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Preface 


THIS BOOK HAS BEEN DEVELOPED TO PROVIDE THE BASIS OF AN 
introductory course in probability and statistics for the college and 
university student. It includes material which has been tried out in 
many classes and by several instructors for almost a decade. 

The instructor using the book as a text or the student interested 
in the subject will find that college algebra is a necessary and suffi- 
cient prerequisite for this course, which aims to teach modern but ele- 
mentary ideas, methods of reasoning, and methods of analysis funda- 
mental but not peculiar to any particular specialized field. Once the 
student has acquired a background of elementary methods, prob- 
ability, and frequency distributions, he ean be taught some of the 
simpler sampling statistics in common use today. Thus, he may 
learn their importance as well as their application. The serious stu- 
dent will find included in this book problems to provoke thought and 
provide practice in statistical methods and reasoning. 

Some colleges and universities offer statistics courses—often with 
graduate credit—in which the elementary concepts and methods are 
not assumed to be known and hence are taught during the first part 
of the course. It seems to me that one general course in probability 
and statistics, with emphasis on statistical reasoning and modern 
methods, helps to avoid useless duplication of instruction. It also 
leaves time in subsequent courses to do more advanced work in spe- 
cialized fields. Such an introductory course also is rapidly becoming 
a necessary part of a student's education even if he does not use sta- 
tistics directly in his specialized field. 

It is helpful to the students during the studies of sampling to pro- 
vide them with some mathematical models of populations so that they 
can obtain sampling experiences which—for a whole class—empirically 
verify for them the sampling distributions given in some of the tables 
which they will be using. It has been my experience that most stu- 
dents need this sort of empirical evidence before they really under- 
stand the nature and the use of sampling distributions. Numbers, 
and other symbols, written on plastic dises can be made to correspond 
closely to normal, non-normal, and binomial populations which are 
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met in actual practice. These populations, if properly employed, will 


nderstand the more common sampling distri- 
butions rather well despite a lack of familiarity with mathematical 


‘statistics. Some of the problems in this book assume that such popu- 
lations are available to the students. 


It is desirable to have calcul 
students can learn what operati 
Solve some of their problems 
that the acquisition of routine 


t takes most of the equivalent of a 
p the student with the ideas and 


E. E 
Kansas Slate College C. Fryer 
Manhattan, Kansas 

December, 1953 
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History and Introduction 


11 HISTORY 


The word “statistics,” the associated mathematical analyses, and 
the general process of statistical reasoning appear to have begun their 
evolution around the time of Aristotle. This evolution can be de- 
scribed in terms of the following four phases, some of which occurred 
simultaneously among different groups of persons: 


(1.11) An early, highly philosophical, study of “matters of state” 
which did little more for the statistical science used today than help 
to suggest its name. 

(1.12) A semi-numerical and strongly sociological stage typified 
by the mathematical and philosophical study of large groups of 
numerical measurements bearing on health, insurance, foreign and 
domestic trade, and political matters. 

(1.13) The development of the mathematical theory of probabil- 
ity starting in the sixteenth century with mathematical attacks on 
the various problems associated with games of chance. 

(1.44) The current phase, starting late in the nineteenth century, 
during which phases (1.12) and (1.13) were combined, improved, and 
extended to produce a branch of mathematics which can handle a 
wide variety of problems pertaining to the drawing of valid and use- 
ful inferences from relatively small groups of numerical measure- 
ments. 


During Aristotle’s time interest developed in comparative descrip- 
tions of states. Aristotle is reported * to have written at least one 
hundred and fifty-eight descriptions of states, covering their histories, 
public administrations, arts, sciences, and religious practices. It was 
customary to refer to such compositions as treatises on “matters of 
state.” That apparently is an important part of the origin of the 


*Harald L. Westergaard (1942), Contributions to the History of Statistics, 
King. 
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term “statistics,” although the name itself was coined many years 
after Aristotle’s death. . 

For quite a long time after Aristotle a weak interest was main- 
tained in descriptions of states partly by the intellectuals who enjoyed 
that pastime and partly by the rulers of the various states through 
their natural desire to know how many subjects they ruled, and to 
ascertain the wealth within their realms. Hence it is probable that 
some sort of crude census taking was attempted. 

During the seventeenth and eighteenth centuries sufficient interest 
was generated in the study of the political, sociological, and economic 
* features of states that societies developed for that purpose. In Ger- 

many this line of intellectual effort caused the development of the 
Staatenkunde, a name which appears to have led rather directly to 
the actual coining of the term “statistics.” However, the Germans 
remained content to pursue the philosophical aspects of “matters of 
state"; hence the Staatenkunde never did become either very mathe- 
matical in character, or very useful. It merely typifies the last stages 
of the purely philosophical phase of the development of the science 
of statistical analysis, and points out its socio-political ancestry. 

Another, and more fruitful, step in the evolution of the present-day 
type of statistical reasoning originated in England under the leader- 
ship of John Graunt. This was a semi-mathematical study of vital 
statisties, insurance, and economic statistics which came to be known 
as “Political Arithmetic.” Epidemie diseases periodic 


asurement and 
an objective means of in ments. Hence 
there was a natural interest į deaths, in esti- 
, in figures on agricultural 
ministering in- 
death and dis- 


Surance against the economic Situations ereated by 
ability. 

Publie interest in Specific mea 
Sources was heightened by the constant danger of war with a neigh- 
boring state, and by the advent of an industrial revolution during 
the eighteenth century. It was j 


: ; gical, and political probl 
Increasingly important a 
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imating the numbers of 


becoming 
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units, and methods for summarizing groups of data. Their efforts to 
apply mathematieal analysis to such problems helped to lay the 
foundation for the statistical methods now in use. 

A third step in the evolution of statistical analysis and reasoning 
came in the development of the mathematical theory of probability, 
without which statistical reasoning could never have attained its 
present reliability and usefulness. Games of chance were especially 
popular among the well-to-do of the sixteenth and seventeenth cen- 
turies; and many problems involving probability were presented to 
the mathematicians of the day for solution. For example, an Italian 
nobleman asked Galileo to explain the following facts: If three dice « 
are thrown, the numbers 9 and 10 ean each be obtained from six 
different combinations of the numbers on the faces of the dice; but 
it has been found from experience that a sum of 10 appears more 
frequently than a sum of 9. Why so? By an enumeration of all the 
physically different ways that three dice can produce sums of 9 or 
10, Galileo was able to answer this question clearly and convinc- 
ingly. His answer appears to be the first published application of 
the theory of probability.* Other prominent mathematicians such 
as Pascal, Fermat, James and Daniel Bernoulli, de Moivre, Laplace, 
Gauss, Simpson, Lagrange, Hermite, and Legendre developed many 
important theorems and methods of attacking problems involving 
chance events, and they passed this information on for later use by 
mathematical statisticians. 

During the last quarter of the nineteenth century, Sir Francis 
Galton took the lead in the development of the ideas of regression 
and correlation when two (or more) measurements are made simul- 
taneously on each member of a group of objects. He appears to have 
built his ideas around problems in genetics. Karl Pearson and C. 
Spearman extended this theory and applied it to studies in the social 
sciences, especially psychology. Karl Pearson and others also had 


* The nature of Galileo’s solution is as follows. A sum of 9 can be obtained 
from any of the following combinations of numbers on three dice: 1, 2, 6; 1, 3, 
5; 1, 4, 4; 2, 2, 5; 2, 3, 4; or 3,3, 3. A sum of 10 is obtained from any of these: 
1, 3, 6; 1, 4, 5; 2, 2, 6; 2, 3, 5; 2, 4, 4; or 3, 3, 4. There are six different com- 
binations giving each of the sums 9 and 10; but the different combinations do 
not occur equally frequently. For example, the combination 3, 3, 3 can be 
thrown but one way whereas the combination 3, 3, 4 can occur on any of three 
different throws, and hence would tend to appear three times as often as 3, 3, 3. 
As a matter of fact, a 9 can be thrown in twenty-five different ways, a 10 in 
twenty-seven different ways, which is the reason that the 10 appears more 
frequently in games than the 9. 
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begun to study the effects of sampling errors on conclusions drawn 
from samples. 


By the end of the nineteenth century the Staatenkunde had ceased 
to exist, and “Political Arithmetic” had died in name but had devel- 
oped into a science of statistical analysis, with emphasis on socio- 
logical and economic applications. The theory of mathematical 
probability had grown extensively as a branch of pure mathematies, 
and also was beginning to be associated with applied statistics. Thus 
the groundwork was laid for the present phase in the evolution of 
statistical theory and methods. 

In 1908 William Seely Gosset, who wrote under the pseudonym 
“Student,” published an article in the journal Biometrika which was 
later to typify the opening of a new era in the statistical analysis and 
interpretation of sampling data. From 1899 until his death in 1937, 
Gosset worked for the brewing firm, Messrs. Guiness. His associa- 
tions with this firm led him into a variety of experiences and sug- 
gested uses for statistical methods which are typical of several of the 
present-day applications of statistics. 

Messrs. Guiness were int 


erested in barley, not just any barley, but 
in those varieties, growing conditions, and practices which would 


produce the best barley for breweries to use. These circumstances 
brought Gosset into contact with agricultural experimentation aimed 
at the improvement of crops and of agricultural practices. 

Messrs. Guiness did not wish to subsidize the raising of 1 
purely for the sake of scientific experimentation; they 


mercial firm which wanted to show a profit from 
That fact, plus the short; 


made Gosset well aware 


Moreover, 
arge crops 
were a com- 
their enterprises. 


^ a commercial 
hich dem ‘nd reliability with a 
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The twentieth cen- 
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based on that attitude toward the purposes of statistical analysis. 

Unfortunately, the statistical ideas and procedures which Student 
introduced in 1908 did not become familiar to persons outside his own 
firm for nearly a decade, at which time R. A. Fisher and his col- 
leagues in England began to extend and to popularize the theory of 
small samples and its applications. The theory of statisties was 
developed extensively by Jerzy Neyman and Karl Pearson's son, 
Egon S. Pearson. They placed special emphasis on rigor in statisti- 
cal reasoning and led the way by publishing many papers in this field. 
Many others have followed their lead since their papers began to 
appear. The results of this research are being applied in many fields, : 
such as biology, the physical sciences, industry, economies, sociology, 
medicine, education, and psychology. 


12 SOME OF THE PURPOSES OF STATISTICAL 
REASONING 


Early in his history man displayed a desire to take numerical 
measurements of the various phenomena involving himself and his 
environment. At first, those measures probably consisted of simple 
counts, or of crude measures of weight, volume, length, and area. At 
present many instruments are available for the precise measurement 
of those features of man's self and environment which interest him. 
He constantly is taking groups of numerical measurements because 
such a procedure can furnish a relatively precise and standard means 
of obtaining the information desired, of using it efficiently, and of 
transmitting that information to others. 'The general purpose of 
statistical analysis is to assist in the collection and the interpretation 
of sets of numerieal measurements which supposedly have been 
taken for some useful purpose. 

Once it is decided that a particular phenomenon should be meas- 
ured numerically, one of two general classes of data is then ob- 
tained. It may be that it was both possible and practicable to secure 
every measurement of that partieular kind which exists or could be 
obtained under the particular eireumstances. Such a complete record 
is one type of statistieal population of numerieal measurements, An 
example is the record of the ages of all the legal residents of the state 
of Kansas on April 1, 1950, as contained in the official United States 
Census for that date. Another example is a list of the LQ.'s of all the 
students entering a partieular university in a given year. 
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However, it is more commonly true that it is impossible, or un- 
wise, to collect a whole population of numerical measurements. - In 
that event we obtain but a portion of a population for actual analysis, 
and attempt to draw from it useful conclusions about the popula- 
tion whieh was merely sampled. If the sample is to be useful it must 
be adequately representative of the population; that is, it should 
faithfully reflect the important features of the population. 

In the event that the whole population of data is available for 
analysis, the purpose of statistical analysis is to reduce what is a 
relatively large bulk of numbers to a comprehensible form by means 
of graphs and tables and/or by calculating a few figures which con- 
tain most of the important information theoretically avail 
original mass of data. For example, the ACE 
of Chapter 2 are numerical measurements w 
the belief that they would be of value to 
School, perhaps by helping to determine wha 
should prepare to enter. Obviously those dat: 
demand some sort of condensation. 


It is worth noting at this point that even though the necessity to 
analyze whole populations of data is a rare circumstance, it is not 
logical to study the statistical analysis of samples without some ade- 
quate knowledge of the statistical features of the populations from 


which the samples are taken. Fortunately a considerable amount of 
useful statistical analysis can 


be learned and appreciated withou 
studying more than two genera D 


l types of populations. 
Whenever we attempt to base conclusions concerning a statistical 


population of numerical measurements upon relatively few observa- 
tions (a sample) from that population, we face two important gen- 
eral questions. (a) How shall the sample be taken so as to maxi- 
mize its chance of being representative of that population? (b) 
Having obtained some numerical observations from the population 
with question a in mind, how do we draw valid conclusions from the 
der the following sampling prob- 
mmission is considering the pur- 
ay construction, and that two com- 
cts for purchase. 


able in the 
scores at the beginning 
hich the college took in 
the student and to the 
t profession the student 
a are so bulky that they 


The commission 
the two cements 
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en of the stand- 


cement. It will be su 
as been decided that t 


Sec. 12 PURPOSES OF STATISTICAL REASONING 7 


ard laboratory specimens will be tested from each company’s prod- 
uct: The test of each specimen will produce an “observation” from 
the population of all such tensile strengths possible from that com- 
pany’s cement. All told, there will be twenty samples taken, ten 
from each company. 

Would it be satisfactory to inform each company of the plans for 
testing the cements and ask each company to provide ten specimens 
of conerete for testing? Or, would it be better to go into the open 
market and purchase a sack of each company’s cement from each 
of ten stores and have a laboratory uniformly make up the ten 
testing specimens? Rather clearly the latter method would be much : 
more likely to produce specimens which were representative of the 
respective strengths of those concretes at seven days of age. 

One of the purposes of statistical theory is to devise methods for 
taking samples in such a way that they do yield essentially the same 
information as is contained in the population which was sampled. 
For the most part, that phase of statisties lies beyond the scope of 
this book; hence no attempt will be made to do more than to remind 
the reader of a few commonsense considerations from time to time. 

Suppose now that the ten sample specimens of concrete have been 
tested for tensile strength at seven days of age, with the following 


results: 
Cement Seven-Day Tensile Strength (Ib./sq. in.) of Concrete 


A No.1 425, 410, 425, 460, 430, 445, 445, 415, 450, and 440 
No.2 420, 450, 405, 400, 400, 415, 435, 425, 400, and 430 


How do we decide from such evidence whether one concrete will, as 
a rule, excel the other in tensile strength; or if either or both conform 
to pre-assigned standards for such building materials? Casual ob- 
servation indicates that cement No. 1 tends to produce greater ten- 
sile strength in its concrete than No. 2; but there are several speci- 
mens from cement No. 2 that produced greater strength than certain 
of the specimens from No. 1 cement. For example, five of the No. 1 
specimens had tensile strengths at or below 430 pounds per square 
inch, and two of the specimens of cement No. 2 had strengths greater 
than 430 pounds per square inch. Without doubt, then, some batches 
of cement No. 2 are better than some batches of cement No. 1. Such 
a situation is met quite frequently in sampling studies; only rarely 
do progressive improvements in methods or materials come on such 
a large scale that all previous methods or materials are excelled with- 
out exception. What is needed—and now available to a highly use- 
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ful degree—is a method of reasoning which enables us to induce from 
relatively few samples useful information regarding the population 
sampled. . 

Inductive reasoning based on evidence obtained from samples 
necessarily runs some risk of error; but as long as the extent of this 
risk ean be measured, the process offers real hope for useful appli- 
cation. Much of the recent research in mathematical statistics has 
been devoted to the development of methods of reasoning based on 
sampling observations, 

The reader should not feel from the preceding remarks that sam- 
pling is useful only in scientific research, 
stantly being confronted with sampling studies of one sort or an- 
other. Radio advertising is quite full of alleged sampling investiga- 
tions during which various products presumably have been tested and 
shown to be superior, Life insurance premiums are based on samples 
of mortality rates among insurable persons. Public opinion polls, 
economic polls, and the like, often reported in the newspapers, are 
attempts to reason from a sample to conclusions about a whole 
population of possible Tesponses to one or more questions. Persons 
Who have visited other parts of the world return and, upon the basis 
of relatively small samples, attempt to say how whole nations or. 
societies are reacting to certain world events. The reader un- 

of many other examples of sampling followed 


doubtedly ean think 
1 applications of either inductive or deductive 


by more or less valid 
reasoning, or what might be better des 


cribed in this instance as 
statistical inference, 

In closing these intr 
student that, as in ma 


because everyone is con- 


i t him from reading the rest of the book 
Tor himself, and from takin 


CHAPTER 2 


The Summarization of Sets 
of Data Involving One Type 


of Measurement 


Whenever a statistical investigation is to be made, two initial steps 
must be taken: (a) A group of objects (persons, plants, bolts, or any- 
thing capable of being measured) is specified as the subject to be 
studied. (b) A decision is made regarding the feature of these 
objeets that is to be measured numerically or by some qualitative 
designation. Such a set of measurements is called a population if it 
includes every member of the group to be defined in a. For example, 
suppose that an economist proposes to study the net cash incomes 
‘of beef-cattle ranchers in Kansas during the decade from January 
1, 1944, to January 1, 1954. It would be necessary first to define 
the group of ranchers to be included in this study. How many beef 
animals must he raise? Must the raising of beef cattle be his major 
source of income according to some standard? Are absentee owners 
included? There are many other matters which would have to be 


considered. When a specific group of Kansas ranchers has been 


defined, part a above has been completed. 

Next it is necessary to decide upon the specific meaning of the 
term, net cash income. Is the measurement to be on a per-animal 
basis, or the total for the ranch regardless of its size? Is any ad- 
justment to be made for inflation, cost-of-living indexes, and the 
like? When net cash income has been defined specifically, part b 
listed above has been completed, and the population is defined. 

In some situations it is feasible to obtain every possible one of 
the measurements in a population, as would be the case if every 
beef-cattle rancher in the group discussed above were to be inter- 
viewed and his net cash income determined according to the defini- 
tion adopted by the investigator. Under these circumstances, the 
purpose of statistical analysis is to summarize the information in the 
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data as clearly and as concisely as is possible. A statistical descrip- 
tion of a population will be found to be important also when the 
population is to be studied by means of samples rather than in its 
entirety. There are various widely used methods of accomplishing 
such a purpose. The choice of a method depends upon what is to be 
learned from the data, and upon the statistical characteristics of the 
population which is being summarized. 

The need for statistical descriptions and summaries is pointed out 
rather specifically by means of the data in Table 2.01. It contains 
the 1290 ACE scores made by students entering Kansas State College 
for the first time in 1947, An ACE score is intended to measure 
certain features of a student/s intellect and aptitudes which are 
thought to be related to his Success in pursuing one of the various 
possible college curricula, If so, ACE scores should help the student 
and the staff to do a better job of fitting the students’ abilities and 
interests to the facilities which the college has to offer. 

The reader is already generally familiar with the term average as 
some kind of usefully typical number which partially replaces a 
whole group of numbers; but he may be less familiar with the fact 
that there are several averages in use. It should be intuitively 
obvious that no single n Tage, can be expected to 
Summarize adequately the set of data in Table 2.01. 


pee Some measure 
of the variability exhibited by these ACE 


c numerica] measurements, § 
will be introduced by mean 
convenience, Thereafter, reference again will be made to Table 2.01 


TABLE 2.01 


1290 ACE Tzsr Scores MADE BY STUDENTS ENTERING Kansas STATE 


COLLEGE DURING 1947 
(Data furnished by the Counseling Bureau of Kansas State College.) 


118 150 100 91 109 87 93 120 92 77 80 101 140 112 
67 97 114 60 111 102. 107 117 77 129 56 105 88 113 
1297 111 107 76 55 102 90 78 105 87 113 138 108 68 134 
76 111 85 98 97 78 79 110 143 62 110 94 135 156 
70 123 69 58 79 123 121 117 84 04 72 97 115 133 


78 130 85 87 66 103 94 111 159 85 115 108 111 41 
120 89 183 78 99 108 55 91 80 103 86 58 99 132 109 
105 109 95 88 153 70 76 109 121 102 119 82 103 45 50 

44 100 93 71 45 78 99 124 99 108 72 87 122 95 

52 94 121 63 108 94 102 103 69 114 5 4l 28 65 107 

99 33 115 121 146 89 124 118 98 131 95 100 97 89 


75 90 113 97 64 100 101 109 100 91 3 128 118 72 
113 166 118 125 82 104 87 113 95 112 105 73 101 121 70 
72 140 115 106 114 113 95 91 75 128 50 68 2 127 


9 6l 4 
109 oy 153 119 94 108 33 101 108 122 88 102 56 92 117 
igo 168 99 60 S8 102 130 128 109 105 115 64 120 127 87 

81 120 115 99 73 105 122 109 95 85 99 60 48 123 
101 06 103 84 131 104 115 69 108 113 37 63 106 78 82 
72 158 89 106 94 105 99 80 94 114 109 91 


106 

7 106 123 74 107 92 98 87 74 ' 61 425 
$m Es] 106 Mi 118 150 145 94 110 95 107 81 112 73 
116 98 44 105 140 109 41 105 43 119 148 112 73 8 
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21 THE ARITHMETIC MEAN AND THE STANDARD 
DEVIATION 


When it is necessary to analyze a population of statistical meas- 
urements it often is desirable to calculate a single number which 
will be typical of the general level of magnitude of the measurements 
in the population. Logically, the first question is: What features 
should averages have in order to be typical of the data in some useful 


sense? "Therefore, the following properties of averages are suggested 
as being either required of averages, or desirable: 


(211) An average should be cl 
to the point (or interval) of 
ments in the population. 


(2.12) It should be as centrally located among the numbers as is 
compatible with property (2.11). 


(2.13) An average should be simple to compute if that is achiev- 
able under the circumstances, 


(2.14) It should be tractable to mathematical operations so that 
useful theoretical information 


can be derived by means of mathe- 
matical methods. 


ose, on the scale of measurement, 
greatest concentration of the measure- 


(2.15) The average should be 
of the data about the average c 
erties (2.13) and (2.14). 


such that measures of the scatter 
an be obtained and also have prop- 


A simple but crude average which sometimes is quite useful is the 
midrange, MR. It is defined as that number (not necessarily one of 
those being studied) which is halfway between the extreme numbers 
in the set being summarized. For example, the extreme ACE scores 
in Table 2.01 are 23 and 183. The difference is 160; hence 

pa MR = 23 + 160/2 = 103, also = ea 
2 
because this is the number which is halfway between 23 and 183. 
Among the desirable properti isted above, the midr 
ls centrally loe 
extremes), and it 
of the data. 
other properties listed. 
to be a very 
the numbers, 


S ange 
[d 1s midway between the 
often is in the region of t] 


Alc S Dot possess the 
1 In addition, the mid 
reliable average because its sj 
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An average called the arithmetic mean has been found to possess 
all the properties (2.11) to (2.15) to a rather high degree for a broad 
class of statistical populations. In addition, it is extremely useful in 
the analysis of sampling data, as will be shown later. Hence the 
arithmetie mean is a highly recommended average. 

The arithmetic mean, » (Greek letter mu), of N measurements: 
Xi... , Xy is calculated by dividing the sum of the N measurements 


by N. Symbolically, 


XQ + Xo + Xa bod XN 
N 


N 
X; 
Eo» 00 au 
as or for brevity, » = N 


(2.11) u= 


To illustrate, suppose that X, = 2, Xz = 5, X; = 1, X4 = 3, and X; 
— 4; then the arithmetic mean is 


u- (24-54-14-84- 4/5 — 8.* 


Problem 241. Suppose that eight players are on the traveling squad of a 
basketball team, and their weights are 152, 170, 165, 185, 201, 174, 191, and 210 
pounds. , What is the arithmetie mean of these weights? 


The first question which may occur to the student is: What are the 
X, in this instance? It is a well-known assumption in arithmetic 
and algebra that the same sum is obtained for a given set of numbers 
no matter what the order of addition; that is, 3 + 6 + 15 = 24 = 
64-15 4-3 = 15 4- 3 4-6, or any other possible order of addition. 
Likewise, in the present problem, it makes no difference which weight 
is symbolized as Xi, which as Xo, ete. It is convenient just to let 
Xi = the first weight listed, X» = the second weight on the list, and 
so on. If that is done in this problem, X; = 152, X» = 170, X; = 165, 


* Although the discussion in this chapter is chiefly devoted to methods and 
ideas appropriate to populations of data—which usually contain a large number 
of measurements—small groups of numbers will be used in examples and prob- 
lems for the purpose of facilitating and shortening discussions. Obviously, most 
of these problems and examples resemble samples far more than populations. 
However, the methods introduced will apply to populations and will not neces- 
sarily be correct or efficient for sampling studies, as will be noted later in the 


book. 
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X,—185, X; = 201, X, = 174, X; = 191, and X; = 210, all in 
pounds. "Therefore, 


8 
È X; = Xi + Xa +--+ Xs = 152 + 170 4----- 210 = 1448 
1 


so that y = 1448/8 = 181 pounds, w 


hich is the arithmetic mean of 
the weights. 


Although the number 181 pounds gives a useful impression of the 
general weight size of the players of problem 2.11, 
the same mean weight could have been obtained 
groups of eight weights, some of which might be considered to be 
quite different from those above. For example, each of the following 
sets of eight weights (in pounds) has y = 181: 


Set 1. 185,180, 181, 184, 182, 179, 177, and 180. 
Set 2. 190,190, 190, 182, 184, 183, 190, and 139. 
Set 3. 172, 180, 165, 160, 175, 168, 180, and 248. 


In Set 1, the extreme weights are but 8 pounds apart; three weights 
are higher than the mean, four are lower than the mean, and one is 
the same as the mean weight. This set differs from that of problem 

ore uniform. In Set 2 


it is obvious that 
for many other 


manner in which it oceu 
metic mean. 


the mean, Tt i 
an X is smaller than the mean, the z is een Nia "n = 
than the mean, the corresponding deviation, x is positive Y 
In the first numerical example of this chapter, Y = 27 Te = +2 
23 P —2, 3, = 0, and Ts = +1. For problem 211) a, = edis = 
EE ae e Mos 00 a, 10, and 
Vs = +29. It is Observed that, at leas Fue que 


t in these instances, Xv = 0. 
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The truth of the general theorem that the sum of the algebraic values 
of the 2; always is zero is established as follows. By definition and 
simple algebra, Sx = X(X — p) = XX — X(g); but XX = Ny, and 
X(u) also = Ny because this symbol requires that we add N terms 
obtained by letting i have values from 1 to N, inclusive. The y stays 
constant for each 1; therefore, X(y) = Np. Since Xv = Na — Ny, it 
is always equal to zero, as was to be shown. 

As a consequence of the truth of the above theorem, a measure 
of the variation about the arithmetie mean cannot be based upon the 
algebraic sum of the z;. "Therefore, one of two actions should be 
taken: (a) Ignore the signs of the z; and obtain their mean there- 
after. Or (b), find some other relatively simple function of the 2; 
which has more of the desirable properties (2.11) to (2.15) than are 
obtained by method a. The latter procedure has proved to be the 
more successful and therefore will be considered first. As a matter 
of fact, it involves a function of the squared deviations, x. 

The quantity 


(2.12) o = VIGAN , 


where c, the Greek letter sigma, has been found by statisticians to be 
a good measure of the variability of a set of numerical measurements 
about their arithmetic mean. Just why it should be so useful cannot 
be shown to the student at this time, but it does have more of the 
desirable properties of measures of variation than any other such 
measure which has yet been devised. The quantity defined by 
formula 2.12 is called the standard deviation of the X; about their 
mean, p. It would be zero if all the X; were equal; the more dis- 
persed they are about the mean, the larger the standard deviation 
tends to be. For example, consider the weights of problem 2.11 and 
of Set 1. The former obviously are more dispersed and generally 
more variable than the latter. The two standard deviations are 18.0 
and 2.4, respectively, which certainly is a concise way to point out 
that, although the mean weights of the two squads are the same, 
their dispersions about that mean are far from the same. 

The square of the standard deviation, c?, is called the variance 
of the X; about p. There are some relatively advanced statistical 
procedures in which it is preferable to work with the variance in- 
stead of the standard deviation, but the latter will be used most of 
the time in this book. 

From the definition of « contained in formula 2.12 it appears that 
each x; must be calculated and squared, but such is not the case. If 
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the z's are difficult to compute, the following results are useful. In 


view of the fact that z? = (X — p)? = (X2 — 24X + p?), it should 
be clear that Xa? = x 


= 3(X? —24X + 42) = SX? — 243X + 3p2. But 

Xp? = Np?, as explained earlier, and 1 = XX/N; therefore, —2pxX = 
—2(3X)?/N and x,? = +1(3X)?2/N. It follows that Sz? = 3X? — 
3X)?/N. If this substitution is made into formula 2.12 


ing alternative method for computing the standard d 
obtained: 


_ Ex:- exp 
(2.13) c= ge 


For a numerical example considered earlier in this chapter, formula 


55 — (15)?/5 - " 
mb = V2, The variance is o? = 2, 


, the follow- 
eviation, o, is 


2.13 becomes ¢ = 


To further illustrate the uses to which o? and o can be put, con- 
sider again the ACE scores of Table 2.01. The arithmetic mean is 
95.7. The standard deviation is caleulated to be 26.1 (see problem 
11 at the end of this section), with the extreme scores being 23 and 
183. It is noted that 95.7 is a bit less than midway between the 
lowest and highest Scores, but i 


l n gencral it is quite centrally located 
in that respect. To obtain a clearer picture of the dispersion of the 
scores between the extreme scor 


os | es, and about the mean, the standard 
deviation will be found to be very useful in subsequent discussions. 
It is entirely possible for different sets of data to have essentially 
the same extremes but very different distributions with respect to the 
mean. The o? and ¢ will help to deseribe these differences, This 
use of the variance and standard deviation is illustrated, in p 
the following discussion. j 


As the student can verify, approxima 


and art, by 


£ the numbers from 70 to 191. induci 
This fact can be put in the followin SE Bici 
g brief form: 
67.1 per cent of the score IM: » + le includes 


S. In a strictly 
3. Such infi 
arization of 


normal population the cor- 
ormation sometimes is con- 
sets of numbers like Table 
: Likewise the interval L + 2o (which includes s; 
inclusive) contains 95.2 per ce 


cores from 44 to 147 
nt of the 1290 sc regar gg 


ores in the table, If 
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this population were perfectly normal, that percentage would be 
95.4. Also the interval y + 30 includes 99.8 per cent of the ACE 
scores, whereas a normal population would have 99.7 per cent of its 
members in that interval. The reader can determine how closely the 
population of Table 2.01 conforms to the normal requirement that 
38.3 per cent of the measurements shall lie not more than one-half a 
standard deviation above or below the mean, p. 

More discussion of normality and of population distributions will 
come later; the point of the above discussion is that knowledge about 
the mean and the standard deviation is useful in the study of one of 


the most important types of populations of data. 


PROBLEMS 


1. Caleulate the arithmetic mean and the standard deviation of the following 
numbers: 2, 3, 9, 7, 5, 4, 10, 6, 3, 1, and 5. 

2. Make up three sets of numbers, each of which has q = 7. 

3. Compute the a, for problem 1 and verify that 32, = 0. 

4. Given the numbers 0, 8, 0, 1, 1, 1, 10, 2, 1, 1, 2, 3, 0, and 1, compute y. 
Does y seem to you to be a good average for these numbers? Why? Ans. 2.21. 

5. Suppose that the mathematics grades for a certain class were 54, 95, 68, 71, 
87, 75, 84, 63, 76, 81, 70, 90, 73, 77, and 61. Calculate x and c, using the indi- 
vidual z; first and then using formula 2.13 for s. 

6. The following percentages of protein in samples of pasture grasses were’ 
made available by Dr. George Wise, formerly of the Department of Dairy 
Husbandry at Kansas State College. Compute 4, c?, and c, given that EX 
= 1423.33 and that EX? = 21,924.2025. 


22.50 15.26 15.63 13.52 10.82 9.22 1225 21.07 1883 
16.52 15.54 16.17 10.03 15.71 12.79 13.11 14.85 11.45 
11.97 11.07 15.26 9.831 12.30 13.04 14.19 12.94 + 14.36 
15.02 11.15 1208 1541 8.56 9.00 13.07 12.51 18.91 
14.98 14.54 13.68 11.78 14.22 13.07 14.27 10.27 11.01 
11.66 8.19 6.75 1448 15.98 14.36 15.24 1448 14.05 
15.02 15.41 10.02 9.96 12.34 16.20 10.19 14.20 12.56 
9.74 1434 13.07 12.33 11.57 1548 11.74 9.39 6.47 
25.00 23.23 16.75 10.62 16.30 17.20 20.68 13.76 11.88 


1057 837 26.20 26.74 2202 20.00 20.36 1583 14.11 
2242 1947 17.98 20.32 1483 1303 1031 895 1157 
16.03 

Ans. 14.23, 16.66, 4.08. 


7. The following are bearings taken with a radio direction finder on a signal 


sent repeatedly from a fixed location. Compute their arithmetic mean and 


standard deviation as though these data constituted a population. 


X: 9, 8, 6, 4, —10, 6, 7, 10, 8, 9, 7, 6, 8, 8, 10, 10, 8, 8, 10, 9, 10, 7, 7, 3, and 8 
(degrees from north). 
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8. The following data are like those of problem 7, but taken on a different 
direction finder. Obtain the variance and the c for these data given that EX? 
— 116,830, EX — 1708. 


X: 66, 68, 69, 68, 71, 70, 66, 70, 68, 67, 68, 73, 68, 65, 72, 73, 68, 67, 69, 65, 64, 
66, 67, 70, and 70 (degrees from north). Ans. 5.58, 2.36. 


9. Work problem 8, after subtracting 60 degrees from each bearing. How 
“much were » and ¢ changed? How much were the z; changed? What if only 
50 degrees had been subtracted? 

10. Use one-sixth of the range in problem 6 as an estimate of the standard 
deviation, and compare this estimate with the true standard deviation. 


Ans. 3.38. 
11. Given that for Table 201 XX = 123,445, and 2X2 = 12,693,988. Calculate 


the arithmetic mean and the variance about the mean, q. 

12, Given the following six yields of Ponca wheat at Manhattan, Kansas, 
compute their mean after first subtracting 27 from each number. (Data 
provided by Department of Agronomy, Kansas State College.) Yield 
(bushels/acre) : 27.2, 40.9, 46.0, 38.1, 43.8, 46.3. 

Ans. 13.2 bushels per acre; 

13. The test weights corresponding to the bushel 
as follows (data from same source): 59.3, 60.7, 60.6, 
the midrange, the arithmetic mean, and the vari 

14. In problems 12 and 13 
weight, gives the more co 
reasons, 

15. Write down every 


therefore, true mean = 40.2. 
yields of problem 12 were 
60.2, 61.9, 58.1. Calculate 


ance, 


» Which of the types of measurement, yield or test 
nsistent results according to this evidence? Give 


fiftieth score in Table 2.01, 
hand corner of the table and working from left to right, Compute the arith- 
metic mean of the sample thus obtained and compute the percentage error 
relative to the true mean, 95.7, 


starting in the upper left- 


2.2 THE AVERAGE (OR MEAN) DEVIATION 

A measure of the variati 
on the numerical values ( 
in the preceding section. 


on about the 


: arithmetic mean based 
signs are ignored 


) of the x; was mentioned 


mean, we might wel] decide 
mean) deviation, Tt is the 
a positive number regardless 
e Weights of problem 
n, m, Were found to be: 


€ àverage—that is, con- 
c mean as the average—the weights of ‘those 


average (or y 


9, 1€ ; and 29 pounds. O 
sidering the arithmeti a 
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basketball players differed from the mean weight of the group by - 


(29 + 16 + 11 4- 7 4- 4 4- 10 + 20 + 29)/8 
— 126/8 — 15.75 pounds. 


Then the average deviation for these weights is 15.75 pounds. 

Symbolieally the average deviation is defined by 

3| X= z 
(2.21) AD= bm AD = 351 
where |x| =a deviation from p taken as a positive number whether 
the corresponding X was larger than p or smaller than p. 

For the weights just used for illustration, c = 18.11 pounds. The 
standard deviation is larger than the mean deviation, as is usual. 
The standard deviation is much more widely used than the mean 
deviation partly because it has many useful applications in sampling 
studies, which after all is by far the more fruitful and interesting 


field of statistical analysis. 


2.3 OTHER AVERAGES 


Another average which is simple to compute and of rather wide 
application for descriptive purposes is the median, symbolized as md. 
The median of a set of numerical measurements is intended to be a 
number such that one-half the numbers are less than or equal to the 
median, and the other half are greater than or equal to the median; 
that is, the median is exactly in the middle of the set of numbers in 
order of size, if such is possible. 

It is necessary—either actually or effectively—to list the numbers 
in order of size before the median can be determined accurately. 
Such an ordered group of numbers is called an ordered array. Thus 
the numbers 1, 5, 2, 3, 0, 1, 8, and 10 do not form an ordered array, 
whereas these same numbers listed as 0, 1, 1, 2, 3, 5, 8, and 10 do 
constitute an ordered array. 

With the definition of an ordered array established, it is con- 
venient to define the median of N numbers: Xi, Xs, ..., Xy as the 
[CN + 1)/2]th number in the array, starting with the lowest num- 
ber in the array. It is noted that only if N is odd does such an ordinal 
number exist; but it is sufficient herein to define an “ordinal” number 
like 4.5 to be a number which is just midway in magnitude between 
the fourth and fifth numbers in the array. For example, for the 
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array used above, N — 8, so that (N 1/2 = 45. oe 
median is md = 2.5, a number midway in size between the 2 (whi 

is the fourth number in the array) and the 3 (which is the fifth num- 

Tay). 

ee e with a little study, that the median is an average 
which will be nearer the region of concentration of the numerical 
measurements in a population than the arithmetic mean if there are 
a few “stray” numbers at one end of the scale of measurement. For 
example, consider the following simulated annual salaries (in thou- 
sands of dollars) of college instructors in one department: 3.1, 3.5, 
3.5, 3.6, 3.6, 3.6, 3.8, 3.8, 3.8, 3.9, 3.9, 4.0, 4.0, 4.0, 4.4, 4.8, 5.0, 6.5, 


84, 8.7, and 8.8. For these data N = 21, 3X = 96.1, p = 4.7, and 


md = 3.9. It is seen that eleven of that staff are receiving within 
$300 of the median salary whereas only three are that close to the 
arithmetic mean. The arithmetic mean exaggerates the typical salary 
in a very real sense for all but the fortunate si 


x at the top. In situa- 
tions of this sort —w 


hick will be described later as skewed distribu- 
tions when more data are involved—the median 


than the arithmetie mean when its purpose is to describe the typical 
measurement in the population. 

If a fairly large group of numbers is to be summarized 
median is a desirable avera 
reducing the necessary labor. For example, the MR for Table 2.01 
is 103; hence we can hope that the median has about the same size. 
On this assumption we can co 
to 100 and thereafter determi 
array without excessive labor. It thus is found that md = 97. 

There are three other aver: considered, and 
which will find occasional applicati 
One is the mode (MO). T be that measure- 
ment which occurs in a giv ith the greatest fre- 
queney, if such a number ple, the mode of the set 
5, 8, 9, 10, 10, 10, 11, 13, — 10. If some number in a 
group of data decisively the greatest frequeney, the 
i may w se; but such is rather rarely the 


is a better average 


and 15 is MO 
occurs with 
ell be the average to u 


The geometric mean of X, 
the product of these numbers, 


(2.31) 


+» Xy is defined as the Nth root of 
Symbolically, 


N 
OM = VE X. 
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Under most circumstances it is easier to compute the geometric 
mean from the relation 


1 
(2.32) log (GM) = = (log X). 


As an illustration consider the numbers 2, 5, 8, and 15. By definition 
GM = the fourth root of the product (2) - (5) (S) - (15); but, using 
logarithms to the base 10, one has log GM = (log 2+ --- + log 
15) /4 = 0.7698. The antilog 0.7698 is approximately 5.9, which is 
the geometric mean of the given set of numbers. 'The geometric 
mean is useful in the caleulation of certain index numbers, in studies 
of biological growth, and, in general, whenever the statistical array 
indicates that a geometrical series is involved. Obviously the geo- 
metric mean is not used if any X = 0 or if the product under the 
radical is negative. 

The last average to be considered herein is called the harmonic 
mean. It also is used only in specialized circumstances, but the pos- 
session of some information about it will help to round out the reader's 
knowledge regarding statistical averages. 

The harmonie mean is defined as the reciprocal of the arithmetic 
mean of the reciprocals of a given group of numbers; or 


M : — 
(2.33) HM-m5a/Xoy/N zq/Xà 


For example, if the X’s are 3, 8, 2, 5, and 2 the denominator is 
X(1/Xj) = 1/3 + 1/8 + 1/2 + 1/5 + 1/2 = 1.6583, approximately; 
hence HM = 5/1.6583 = 3.02. One use of the harmonic mean comes 
when rates of some sort are involved. Consider this problem. A man 
drives the first 50 miles of a trip at 50 mph, and the second 50 miles 
at a rate of 60 mph. What is his average rate for the trip? By the 
usual definition, the average rate is obtained by dividing the total 
distance traveled by the total time taken to go that distance. The 
distance traveled was 100 miles. The first 50 miles took one hour, 
and the second 50 miles took five-sixths of an hour; hence the total 
time was 11/6 hours. Therefore, the average rate of speed was 
100/(11/6) = 600/11 = 54 and 6/11 mph. The harmonic mean of 
50 and 60 also is 54 and 6/11 mph; that is, the required average rate 
is just the harmonic mean of the two rates in this instance. It is 
noted that the distance traveled was the same for each rate of speed. 

Now suppose that a person drives for one hour at 50 mph and then 
the second hour at 60 mph. What is the average rate of speed during 


sed, No. cM 


A ccessio 
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this trip? The total distance traveled is 110 miles, and it on 
hours; therefore the average rate is 55 mph. But that is just the 
arithmetie mean of the two rates. It is seen that when time (hours) 
was fixed in the problem, the appropriate average was the arith- 
metie mean; when the distance (miles) was fixed and time was 
variable according to the speed of travel, the appropriate average 
was the harmonie mean. . . 

In general, the proper average to use in any particular situation 
either will be determined at the outset by previous practices in the 
particular sort of work, or it can be determined by a bit of prelim- 
inary study of the matter. Hence no attempt will be made to lay 
down rules. However, it should be apparent to the reader that when 
a body of data is to be summarized statistically there may be several 
possible choices of averages and also of measures of variation. We 
should be fully conscious of this fact when we compute averages, or 
when we interpret those averages computed by someone else. 


PROBLEMS 


1. The following numbers are 
before World War II: 1300 


> , » ] , 1400, 1450, 1750, 1500, 1600, 
and 1400. Calculate th 


e arithmetic mean and the median, and state which aver- 
age you consider the more typical of these salaries. 


2. Suppose, in problem 1, that the following raises in salary were given: $2200 


to $3000, $3000 to $3500, $4500 to $5000, $1800 to $2200, $1750 to $2200; and 
all others are given a $100 raise. 


The salaries of problem 1 add to $41,850, and 
the total of the raises is $4650, approximately 11 per cent of $41,850. Is it then 
fair to state that those teachers received an 11 per cent increase in salary, on 
the average? 


á i Compute the geometric mean of the Salaries in problem 1 to the nearest 
ollar, 


for peaches at 
8. Do as in problem 7 except to conside: 
bought at each 


Ans. $3 per bushel. 
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9. Suppose that 5 bushels of the S3 peaches, 10 of the S2 peaches, and 15 


bushels of the $4 peaches were purchased. What was the average price per 
bushel? 


24 FREQUENCY DISTRIBUTIONS 


To introduce the method of constructing frequency distributions, 
and to show what sort of information can be derived from them, 
reference is made to the numbers of Table 2.01. It is possible by 
means of problem 11, section 2.1, to calculate that » = 95.7 and 
c — 26.1. These statistical constants furnish some useful information 
about the population of scores, but they fail quite badly to sum- 
marize them adequately. For example, a person who made a score of 
120 could not be told accurately how he compared with the others 
taking this same test, and that information usually is important in 
the use of such tests. One way to obtain this sort of information is to 
construct frequency distributions and graphs which display the out- 
standing features of the population. 

Two types of summaries of distributions will be considered both 
numerically and graphically: a frequency and a relative cumulative 
frequency (or r.cf) distribution. Both distributions will be de- 
scribed by means of a grouping of the individual scores into con- 
venient score classes, even though such frequency distributions could 
be made without grouping the members of the population into classes. 
The scores then lose their individual identities and become members 
of ten to twenty groups. The data become more manageable, and 
little accuracy is lost in the process. To illustrate, consider Table 
2.01 again. The extreme scores have been noted previously to be 23 
and 183 so that the range is 160. If the range is divided by 10 a 
quotient of 16 is obtained. Classes of that length would give the 
minimum acceptable number of classes; hence for convenience in 
tallying (as shown below) the class interval will be taken as 15. 
Table 2.41 was constructed starting with the lowest score at 10 purely 
because it was convenient and the lowest class included the lowest 
ACE score in Table 2.01. The actual tallying of the data is shown, 
as is the summarization of the tallies into a frequency (f) for each 
class. In Table 2.42 a more concise form of the frequency distribu- 
tion is shown along with the 7.c.f. distribution. The latter distribu- 
tion gives the decimal fraction of the ACE scores which were less 
than or equal to the upper limit of the corresponding score class at 
the left. For example, practically one-third (actually .332) of the 
scores were at or below a score of 84, according to Table 2.42. 
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TABLE 2.41 
Frequency DISTRIBUTION TABLE FOR THE DATA or TABLE 2.01, SHOWING 
TALLYING 
Score Class j 
175-189 /// i 
160-174 //// 33 
145-159 TW MU D mu m mp 
130-144 MA DI TAL MA THU DHL TH THU DHL DHL PAL TTT - 
WIND 
115-129 JW MU THU THU DH] DH DH DHL DHL DHL THU TAL DALT 
TH DH TM TAU HU DHL DHL DHL TUO THU TM DHL DHL . 
THE TAL DAL TL DHL THU DD THU IL d 186 
100-114 TAI PHIL MA TM DWI TH THU DHL THU DHL DHL DHL DHL DI 
DN TAU TU DHL TN DHL DNO THU DHL DHL NL THU TU 
PAL DHL THE ML PAL DS PAL LDL DHL DHL IND TAL DU = 
THI THU DAL THU TAL DAL DAL DAL DAL TI 278 
85- 99 MU NMN DA TAU NU DHL THU THU THU THE THU TH DHL THU 
PAL TAL DHT THU THU DHL DNE DH TP 
TA THU DHL U OO THU THU MU TMI TH DHL HL DU W- 
TM TA MI DHL THU THU THU ML INL DND THE HD 277 
70- 84 THU THU DH HL DA THU DAL THU ML TH DAL ND ML DU 
THU TRU DIO THO THU DAL MU ML THU TH THL DHL THE DU 
THU TAL THU TAL THU DHL THU DHL DHL THE THE 209 
55- 69 THU BL TY TAL THU DHL THU DD DHL TH IND IND DL TH 
TAL TA TAEDA PAL THU PA THU THE DNI THE DHL [I 132 
40- 54 TA) MU DMI TH] DHL THU THL MI DHL THU THU TT 69 
25- 39 NU IA Wl 15 
10- 24 /// 3 
Total 1290 


To assist in the drawing of conclusions from Tables 2.41 and 2.42, 
the information they 


n contain is presented in graphic form in Figure 
2.41. Conclusions such as the following can be drawn from that figure 
and the tables pertaining to Table 2.01 


(2.41) Only about 7 per cent of the students made a score less 
than 55. This informati 


1 on ean be read dircetly from Table 2.42 and 
verified approximately from the r.c.f. curve of Figure 241. Also, 


approximately 50 per cent of the st: 
a fact which corresponds closely with the fact that the exact median 
is 97. Information of this sort can be obtained from Figure 2.41 by 
reading horizontally from r.c.f. = $ 


i -50 over to the r.c.f. curve and then 
vertically downward to the scale of ACE scores, 


udents made a score of 98 or more, 
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TABLE 2.42 


Tue RELATIVE CUMULATIVE FREQUENCY DISTRIBUTION OF THE DATA 
IN TABLE 2.01 


Score Class f cf. r.cf. 
175-189 3 1290 1.000 
160-174 4 1287 .998 
145-159 33 1283 .995 
130-144 81 1250 .969 
115-129 186 1169 .906 
100-114 278 983 .762 

85- 99 277 705 .546 
70- 84 209 428 .332 
55- 69 132 219 .170 
40- 54 69 87 .067 
25- 39 15 18 .014 
10- 24 3 3 .002 


Total 1290 


(2.42) With specific reference to the student mentioned above 
who made a score of 120, it is learned that about 82 per cent of the 
students did no better; or only about 18 per cent beat him. Hence 
he should be considered to be quite high in aptitude and intelligence 
relative to those who took that same test, and would be expected to 


do rather well in college. 


300 1.00 
270 90 
240 80 y 
210 E 
B 180 50% 
3 150 50 8 
& 120 40 E 
90 30 2 
50 20 8 
30 10 
0 


10 25 40 55 70 85 100 115 130 145 160 175 
Test score 


Figure 241. Frequency distributions of the ACE test scores listed in Table 
201. Data furnished by the Counseling Bureau of Kansas State College. 
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Other similar information ean be read from the above tables and 
graphs. Moreover, it will be shown later that the r.c.f. graph can e 
employed to determine quartile, decile, and even percentile limits 3 
the graph is drawn with sufficient care. In addition it vil be shown 
in the next section that in some situations good approximations to p 
and o can be computed from the frequency distribution table. . In 
brief, the frequency and r.c.f. tables and graphs serve as visual guides 
and as sources of good approximations. If more precise information 


is needed or desired (and it seldom is), one can analyze the individual 
observations. 


The graph of the r.c.f. distribution sometimes is called an ogive. 


PROBLEMS 


1. Following are some numbers of house flies counted on individual dairy 
cows which had been sprayed with a 3 per cent solution of Thanite in 40 oil: 
35, 37, 41, 103, 174, 7, 11, 32, 23, 7, 6, 3, 14, 23, 28, 36, 25, 27, 3, 3, 13, 14, 14, 6, 
15, 9, 11, 21, 9, 12, 3, 15, 19, 29, 26, 1, 8, 4, 9, 7, 12, 5, 1, 3, 5, 60, 11, 6, 4, 7, 22, 
28, 5, 3, 6, 15, 1, 2, 11, 4, 27, 1, 0, 0, 19, 6, 2, 3, 4, 13, 5, 12, 11, 14, 45, 4, 38, 5, 17. 
27, 39, 33, 13, 9, 8, 33, 19, 6, 12, 32, 11, 35, 18, 11, 25, 23, 45, 30, 4, 4, 15, 15, 16, 
11, 16, 18, 32, 49, 129, 7, 21, 26, 76, 40, 5, 7, 5, 7, 4, 62, 91, 133, 61, 59, 20, 26, 10, 
12, 6, 7, 8, 8, 2, 24, 21, 51, 110, 11, 6, 4, 4, 5, 5, 5, 13, 3, 6, and 7. Construct 
frequency and relative cumulative frequency distributions for these data, esti- 
mate the median, and decide whether & or md is the preferable average for 
these data. Is this a skewed distribution? Use class intervals: 0-8, 9-17, etc. 

2. Graph the distributions asked for in problem 1. 

3. Compute the arithmetic mean and the median of the counts in problem 1. 
Compare them, and draw appropriate conclusions. EX — 3031, XX? = 163,439. 

4. Estimate from the ogive (r.c.f. curve) for problem 1 what percentage of 
the fly counts lie between 5 and 25, inclusive. Check your calculation by 
actually counting in problem 1, Ans. Graph, 46; by count, 45. 

5. Use the following frequency distribution table of S-week weights (in grams) 
of male White Rock chickens raised at the Kansas State College Poultry Farm 
and the accompanying graphs to: (a) estimate A, (b) determine what percentage 
of the weights exceeded 800 grams, (c) determine the range covered by the 
"middle" 50 per cent of the weights, that is, excluding the upper and lower 
25 per cent of the weights. 


and the relative cumulative frequency distribu- 


» 20, 34, 87, 28, 14, 30, 24, 53, 57, 93, 36, 


> 30, 32, 58, 4, 20, 30, 33, 20, 22, 28, 11 23 46 41, 
41, 44, 23, 18, 41, 48, 81, 80, 70, 5, 2, 13, 21, 21, 171,1,7,10,5,2 17 9 35 ( 
10, 23, 19, 3, 25, 16, 131, 19, 19, 24 18, 10 


17, 9, 35, 6, 8, 


Sec. 24 FREQUENCY DISTRIBUTIONS 27 


Relative 
Cumulative Cumulative 
Frequency, Frequency, Frequency, 
Weight Class d cf. r.c.f. 
1025-1074 1 1217 1.000 
975-1024 16 1216 .999 
925- 974 29 1200 .986 
875- 924 66 1171 .962 
825- 874 148 1105 .908 
775- 824 169 957 .786 
725- 774 265 788 .647 
675- 724 210 523 .430 
625- 674 155 313 .257 
575- 624 85 158 .130 
525- 574 51 73 .060 
475- 524 17 22 .018 
425- 474 5 5 004 
Total 1217 
250 1.00 
.90 3 
200 80 S 
> 
70 È 
> 2 
2 150 60 $ 
El 50 E 
& 100 40 2 
30% 
a 
50 120583 
10 
o LE CL O 
425 525 625 725 825 925 1025 
Weight 


5, 13, 10, 12, 10, 6, 6, 14, 24, 61, 25, 26, 21, 12, 
15, 18, 19, 26, 21, 11, 0, 0, 8, 34, 66, 32, 7, 8, 23, 20, 24, 62, 8, 15, 19, 33, 20, 51, 
11, 20, 13, 27, 15, 10, 16, 16, 5, 4, 24, 30, 37, 26, 17, 14, 15, 6, 3, 22, 53, 54, 74, 
1, 10, 12, 22, 49, 52, 31, 7, 20, 23, 28, 56, 2, 6, 6, 30, 30, 38, 1, 2, 4, 21, 51, 14, 5, 
17, 21, 28, 9, and 7. o o 

7. Construct the relative cumulative frequency distribution for the data of 
the preceding problem, and read from it the value of the median. Check that 
result with the value obtained from an ordered array of those data. 

8. Use the graph of the relative cumulative frequency distribution for the 


counts in problem 6 to determine what percentage is less than or equal to ten 
flies per cow. Ans. 29 per cent. 


18, 20, 8, 13, 17, 27, 23, 23, 10, 2 
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9. Within what extremes did the lowest one-fourth of the counts listed in 

problem 6 lie? The middle one-fourth? The highest one-fourth? . 
10. Take any newspaper with at least one hundred bond or stock quotations 


and make frequency and relative cumulative Írequency distributions of those 
prices. 


25 CALCULATION OF THE ARITHMETIC MEAN AND 
THE STANDARD DEVIATION FROM FREQUENCY 
DISTRIBUTION TABLES 


If the frequency distribution table has class intervals of equal 
lengths, approximate values can be computed for y and o with a 
considerable saving in labor as compared to their computation from 
the individual measurements. The method of computation involves 
the sole assumption that the numbers grouped into each class actu- 
ally were at the midpoint of their class. Although that assumption 
is not strictly correct, the individual discrepancies usually balance 
out so well that the net error is unimportant in practice. If it should 
be decided that some additional accuracy is needed, Sheppard’s cor- 
rections for grouping can be employed. (See, for example, Kenney, 
Mathematics of Statistics, Part One, D. Van Nostrand.) 

Table 2.51 presents methods for computing p and e which follow 
directly from the definitions of these quantities if all the data in a 
class are considered to be at the midpoint of the class, For example, 
the data in Table 2.51 would be considered to be 22.5, 22.5, 17.5, 1735, 
175, 17.5, 17.5, 17.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 
12.5, 7.5, 7.5, 7.5, 7.5, 7.5, 2.5, and 2.5 each midpoint appearing pre- 
cisely that number of times indicated by the class frequency, f. The 
student should cheek the fact that the sum of these products is 317.5, 
which is shown in Table 2.51 as the total of the column headed “f+ 2," 
The symbol z is employed to denote the midpoint of the class in 
For convenience and for uniformity of procedure, the midpoint of a 
class of data measured on a continuous scale is defined to be the 
lower limit of the class (as recorded in the table) plus one-half the 
length of the class interval. Also, the length of the class interval, for 
such data, is defined as the numerical difference between any two 
Successive left- or right-hand end points of classes. Thus the mid- 


point of the class “20-24.999 ..." is 20 + (1/2) (5) = the z for this 
class interval. 


The reader should note that there will be circumstances in practice 
which will Justify a different determination of the midpoint, z. For 


terval. 
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example, if some objects have been weighed to the nearest pound it 
is reasonable to suppose that an interval written as 20-24 actually 
means 19.5 to 24.5. If the class intervals are so written, the above- 
stated rules apply. The length of the class interval will be 5 as be- 
fore, but the midpoint will be computed as 19.5 + (1/2) (5) = 22.0 
instead of 22.5, as it would be if computed on the assumption that 
the interval started at 20. 

If we are summarizing data which can only be integers, a class 
interval such as 20-24 should include only the numbers 20, 21, 22, 
23, and 24. It then is reasonable to take z = 22. 'The length of the 
class interval should be taken as 5 again so that the numerical dis- 
tance between midpoints will coincide with the length of the class 
interval, which seems to be a reasonable requirement. The proper 
procedure for other methods of measurement can be figured out along 
the lines just outlined. 


TABLE 2.51 


ILLUSTRATION OF A METHOD OF CALCULATING u AND o FROM THE DATA IN A 
Frequency DISTRIBUTION TABLE WITH CLASS INTERVALS or EQUAL LENGTHS 


Class Midpoint Frequency 


Interval z f fe 2=u Ple a) 
20-24.9 22.5 2 45.0 9.8 192.08 
15-19.9 17.5 6 105.0 4.8 138.24 
10-14.9 12.5 10 125.0 — 0.2 0.40 
5- 9.9 7.5 5 37.5 — 5.2 135.20 
0- 4.9 2.5 2 5.0 —10.2 208.08 
Totals 25 317.5 674.00 


" Sy 
u- Eft 3175. 12.7; o = BO 5.19, approximately. 
y 25 z 


Another, and easier, method of computing » and o from a frequency 
distribution table with equal class intervals is illustrated in Table 
2.52 along with a partial demonstration of the generality of the 
method. The procedure involves the same assumption made above 
and produces exactly the same values for p and o. However, in this 
method the class interval is employed as the computational unit, 
with the result that the sizes of the numbers needed in the process 
are smaller than those of Table 2.51. This makes the computations 


simpler. 
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The procedures outlined in Table 2.52 can be justified as follows. 


It is obvious that the midpoints, z;, of Table 2.51 can be rewritten 
as follows: 


2.5 — 12.5 — 2(5), 
7.5 = 12.5 — 1(5), 
12.5 = 12.5 + 0(5), 
17.5 = 12.5 + 1(5), 


22.5 = 12.5 + 2(5). 
Then 


2(f-2) = 2[12.5 + 2(5)] + 6[12.5 + 1(5)] + 10[12.5 + 0(5)] 
+ 5[12.5 — 1(5)] + 2[12.5 — 2(5)]; 
= (2+6 + 10 + 5 + 2)(12.5) 
+ (2(2) + 6(1) + 10(0) + 5(—1) + 2(—2)](5); 
or, a bit more generally, 
2(f-2) = (Bf)(12.5) + E(f-a)q), 


if d = +2 for the top class of Tables 2.51 and 2.52, d = +1 for the 
next class down, ete., until d = —2 for the bottom class of each of 


those tables. (The symbol I stands for the length of the class in- 
terval.) Therefore, 


2(f-z) E(f-d)(I) 
2.51 E SEIS 
ET. “20 +H 


ading to 


uld be taken as zero for 
the class ae the greatest frequency. If the distribution is quite 
non-symmetrical, it is advisable to shift the choice Y g 

other so that the positi i Mire avin: 


’s will be more nearly bal- 
o classes f i 
the greatest, frequency, f. a ea 
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It can be seen in Tables 2.51 and 2.52 that when d = 0 for the 
class with the largest frequency (f) the resulting arithmetic involves 
smaller numbers than for the other methods. It should be noted, 
again, that all three of the methods illustrated give exactly the same 
answers; the only differences lie in the ease of computation. 


TABLE 2.52 


ILLUSTRATION OF A SIMPLIFIED METHOD FOR COMPUTING 4 AND O FROM A 
Frequency DISTRIBUTION TABLE WITH EQUAL CLASS INTERVALS 


Method A (d taken 0 Method B (d taken 0 


Mid- Fre- for interval with for interval with 

Class point quency greatest frequency) lowest frequency) 
Interval z F d fa fe d fd fd 
20-24.9 22.5 2 +2 4 8 0 0 0 
15-19.9 17.5 6 Ti 6 6 ni =- 6 
10-14.9 12.5 10 0 0 0 =2 —20 40 
5- 9.9 7.5 5 -1 —5 5 =$ —15 45 
0- 4.9 2.5 2 —2 —4 8 —4 -8 32 
Totals 25 +1 27 —49 123 


u = (e for class with d = 0) + > (I) = 12.5 + (1/25)(5) = 12.7. 


us [gd — BU OPM = p) [27 — 07/25 - 510, 
INT 5 (5) = 5.19 


The derivation of the formula shown for o is more difficult than 
that for y, as might be expected, but it can be obtained by elementary 
algebra, formula 2.13, and by expressing each z in terms of the one 
for which d = 0. This derivation will be left as an exercise for the 
ambitious student. 

The methods just described can be applied to obtain satisfactory 
approximations to the arithmetic mean and the standard deviation 
of the ACE scores in Table 2.01, a task which clearly would be quite 
laborious if formulas 2.13 and 2.11 were to be employed directly on 
the 1290 numbers in that table. The d are taken as zero for the class 
with a frequency of 277 (Table 2.42) instead of the class with f — 278 
because they are essentially the same size and the distribution is a 
bit non-symmetrical (or skewed) in the direction of the lower ACE 
scores. The general result is to have smaller d’s with the larger fre- 
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quencies, and hence to make the computations somewhat easier. Fol- 
lowing are the required calculations based on Table 2.42; it is assumed 
that the scores are necessarily integers. 


z f d fed pa 
3 6 18 108 

4 5 20 100 

33 4 132 528 

81 3 243 729 

186 2 372 744 

278 1 278 278 

92 277 0 0 0 
200 —1 —209 209 

132  -2 —264 528 

69  —3 —207 621 

15 —4 — 60 240 

3 —5 — 15 75 

Z(f) = 1290 =(f-d) = +308 41600 = x(f.d)) 


By the formulas previously used, 


u = 92 + (308/1290)(15) = 95.6 compared to the true mean of 95.7. 


[4160 — (308)?/1290 
o = (15) E = 26.7 compared to the true value of 


26.1. 


In view of the fact that the scores were integers, these approxima- 


tions certainly would be considered satisfactory, and the time and 
labor saved by these methods are considerable. 

The distribution of the population of ACE scores is rather sym- 
metrical, that is, there is a region of high frequeney about halfway 
between the extremes, and the frequency of occurrence of scores away 
from this region diminishes at about the same rate as scores are 
considered equally far above and below the region of highest fre- 
quency. This distribution is shown in Figure 241. With this type 
of distribution the arithmetic mean is an excellent average to use as 
a part of the description of the population. 

Other distributions may be non-symmetrical, or skewed. For such 
populations the median often serves as the more descriptive average. 
As a matter of fact, the difference between the sizes of the arithmetic 
mean and the median is an indication of the degree of skewness or 
lack of symmetry, in the frequeney distribution. If the distribution 
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is perfectly symmetrical (no skewness) the arithmetic mean and the 
median are equal. The more skewed the distribution, the farther 
apart the median and this mean may become. 


PROBLEMS 


1. Compute the arithmetic mean of the following numbers by Method A of 
Table 2.52: 


X: 24, 8, 7, 14, 21, 10, 12, 14, 17, 9, 11, 5, 15, 16, 8, 2, 13, 18, 12, 3, 15, 4, 16, 19, 
and 11. 


Use class intervals 2-5.9..., ete., to 22-25.9.... 
2. Compute the arithmetic mean and the standard deviation for problem 1 
exactly, and compare with the values obtained by the methods of Table 2.52, 
Ans. p=122, 0 = 55; they are 128 and 53 by table. 
3. Put the numerical measurements of problem 1, section 2.4, into a fre- 
quency distribution table with class intervals of equal lengths, and compute 
the standard deviation of those counts. 
4. Do as in problem 3 for the data of problem 6, section 2.4. Ans. 23.7. 
5. Calculate the mean and standard deviation for the hypothetical data in 
the following table. Also, compare six times the standard deviation with the 


range as nearly as it can be derived from the table, 


Class Interval Frequency 
28-29.9... 5 
26-27.9... 16 
24-25.9... 29 
22-23.9... 41 
20-21.9... 50 
18-19.9... 45 
16-17.9... 32 
14-15.9... 20 
12-189... 9 
10-11.9... 3 

Total 250 


6. Graph the relative cumulative frequency distribution for problem 5 and 
age of the measurements which exceed 23. Which ex- 
Which lie between the mean and 23. 

Ans. 28%, 50%, 22%. 
m the 7.c.f. curve? Which is 


read from it the percent: 
ceed the arithmetic mean. 


7. What is the median for problem 5 as read fro: t 
the modal class? Would you expect the mode and the median to differ by 


as much as two units; or less than two units? Give reasons. 

S. Graph the following actual or estimated age distributions of the United 
States population and draw appropriate conclusions regarding apparent trends 
during the decades covered. Consider top class as 0-4 and bottom one as 75-79. 
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(Data from Current Population Report, Population Estimates, August 10, 1950, 
Bureau of the Census.) Numbers are thousands. 


Age Class 1940 1950 1960 

Under 5 years 10,542 16,580 13,121 

5-9 10,685 13,959 15,693 
10-14 11,746 11,349 17,439 
15-19 12,334 10,561 13,860 
20-24 11,588 11,585 11,274 
25-29 11,097 12,161 10,725 
30-34 10,242 11,439 11,771 
35-39 9,545 10,960 12,211 
40-44 8,788 10,061 11,377 
45-49 8,255 9,231 10,713 
50-54 7,257 8,254 9,583 
55-59 5,868 7,440 8,469 
60-64 4,760 6,210 7,205 
65-69 3,748 4,611 5,980 
70-74 2,561 3,282 4,428 
75 years and over 2,655 3,716 5,083 


9. Change the top and bottom classes in 
compute y and md. Which do you consid 
10. Given the following frequency distribution of the mi: 
peratures in 116 cities of Kansas, compute the arithmetic 


ard deviation: 


Temperature 
Interval 
—17.4 to —15.0 
—199 —17.5 
—224 —20.0 
—24.9 —22.5 
—274  —250 
—29.9  —27.5 


11. Given the follow: 


SERG ao + 


five Kansas counties, compute the me: 


the median length. Which would be tl 


Class Interval 
198-202 days 
193-197 
188-192 
183-187 
178-182 
173-177 


dí 


problem 8 to 04 and 75-79, and then 


er the better average here? 


Temperature 
Interval "i 
—32.4 to —30.0 17 
—349  —32.5 1 
—374  —35.0 2 
—39.9  —37.5 0 
—424 —40.0 a 
Z(f) = 116 


ing table of average lengths of growing 


Class Interval 
168-172 days 


163-167 
158-162 
153-157 


| i xo ™ 


nimum annual tem- 
mean and the stand- 


Ans. —25.7°, 44*. 
season for ninety- 
an length of growing Season, and also 
he most descriptive average here? 
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2.6 PERCENTILES, DECILES, AND QUARTILES 


The standard deviation about the arithmetie mean, the range, the 
average deviation, and the comparative magnitudes of the median 
and the mean (all presented earlier) provide useful information re- 
garding the dispersion of the numerical measurements in a group of 
data which is being analyzed. However, there are some circum- 
stances in which it is desirable to divide the ordered array into seg- 
ments each containing a stated percentage of all the numbers in the 
set. More specifically, it may be convenient to partition a large 
body of data into four, ten, or one hundred subgroups, each contain- 
ing approximately the same number of measurements from the set, 
and with the subgroups corresponding to successive segments of the 
array. The subgroups will be called quartiles if four divisions are 
employed, deciles if there are ten subgroups, and percentiles if there 
are one hundred subgroups.* The aim in stating the upper limit of 
the first quartile, for example, is to designate a number such that 


one-fourth the numbers in the array are less than or equal to that 


upper limit. "SS y 
Although the upper and the lower limits of the quartiles, deciles, 


and percentiles could be read from a carefully drawn r.c.f. curve if 
the data are sufficiently numerous, it is desirable to have precise 
definitions for them. This could be done in a variety of ways, not 
essentially different, so that certain convenient and reasonably stand- 


ard definitions will be adopted rather arbitrarily. 
es and methods for determining the limits on 


the quartiles, deciles, and percentiles are considered attention is called 
to the following two arrays and to some general problems inherent 
in the determination of such subgroups as quartiles: 

Set 1. 1, 2, 3, 4, 6, 8, 8, 9, 10, 10, 11, 12, 15,18,18.  N — 15. 
Set2. 1,2, 4,7, 9,9, 11, 11, 12, 15, 15, 18, 20, 24, 25, 27. N = 16. 


divide these sets of numbers into four sub- 
qually many numbers, if possible, and com- 
lity in other instances. Two facts are 


Before general rul 


Suppose that we wish to 
groups, each containing e 
ing as close as possible to equa 

* It seems to the author that the term percentile should refer to an interval 
which includes approximately one per cent of all the measurements, However, 
most textbooks use this term to designate only one end point of what is called 


a percentile herein. Similar remarks apply to the terms decile and quartile. 
Since we usually speak of a score being in a percentile rather than a£ it, usage 


seems to support the point of view taken herein. 
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immediately clear. (a) When N is not a multiple of four, we cannot 
define four groups each containing one-fourth of N measurements; 
and (b) repetitions of numbers will pose a problem in some instances 
because numbers of equal size logically must be in the same subgroup, 
and yet to put them there sometimes will cause one subgroup to con- 
tain more than its stated proportion of all the measurements, 

It will be convenient first to describe the method to be used to 
determine percentile limits because deciles and quartiles can be de- 
fined in terms of percentiles. The general aim in defining percentiles 
is to divide the ordered array into 100 subgroups, each of which con- 
tains one per cent of the numbers in the set, as nearly as this is pos- 
sible. This result will be accomplished by defining the upper limit 


of the pth percentile to be the [2 (N + D] th number in that 
0 


array. For example, if N = 1290, as in Tal 


ble 2.01, the upper limit 
of the ninetieth percentile is the [2% (1291)]th, or the 1161.9th, 


number in the array or along its scale of measurement. Such an 
"ordinal? number as 1161.9 will be defined to be the number w 
is nine-tenths of the way between the 1161st and the 1162nd numbers 
from the bottom of the array. It is seen from Table 2.42 that there 
are 1169 numbers less than or equal to 129. With this information 
it is found that the 1161st and the 1162nd scores in order of size are 
128 and 129, respectively. Hence, the 1161.9th number along the 
scale of the ACE scores is 128.9, which, then, is the upper limit of 
the ninetieth percentile. The lower limit of this percentile is just the 
upper limit of the eighty-ninth percentile. By definition, this is the 
[89(1291) /100]th number along the array of the ACE scores. Since 
89(1291) /100 = 1148.99, the lower limit of the ninetieth percentile 
is a number which is -99 of the way between the 1148th and 1149th 
scores from the bottom of the array. The 1148th score is 127, whereas 
the 1149th score is 128; hence the lower limit, of the ninetieth per- 
centile is 127.99. Tt follows then that the ninetieth percentile con- 
tains scores of 128 only. Actual enumeration discloses that there 
are 13 scores of 128, which is as close to one per cent of 1290 as is 
possible with integers. Such close agreement with the ideal will not 
be attained with most of the percentiles, especially in the neighbor- 
hood of the mean and the median, because there will be repetitions 
of scores which will cover y 
. When much of this occurs 
given by deciles or ey 


hich 


en quartiles. 
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The upper limit of the first decile is the same as the upper limit of 
the tenth percentile, and similarly for the other deciles. The upper 
limits of the first, second, and third quartiles are the same as the 
upper limits of the twenty-fifth, fiftieth, and seventy-fifth percentiles. 
It should be clear that the median is the upper limit of the second 
quartile. 

It is traditional to designate the upper limits of the first and third 
quartiles as Qı and Qs, respectively, even though the term quartile 
may be used differently from the way they are used in this book, as 
was mentioned earlier in a footnote. 


PROBLEMS 


1. Following are the average temperatures for July in Topeka, Kansas, from 
1901 to 1930, inclusive, in degrees Fahrenheit: 86.6, 77.0, 77.6, 75.0, 74.1, 74.8, 
787, 76.0, 780, 794, 788, 79.9, SIS, 803, 74.0, S19, 804, 78.0, 81.6, 768, 795, 
76.4, 79.0, 752, 78.6, 79.0, 76.6, 78.3, 79.0, and 82.4. Obtain Q}, md, and Qs. 

2. Determine and interpret the limits of the second decile for the data of 
problem 1. Also compute the median. Ans. 7488 to 76.08; md = 78.65. 

3. What are the limits of the third quartile of the data of problem 6 of 
section 2.1? 3 J 

4. What are the limits of the first quartile for the fly counts given in prob- 

Ans. 0 to 8 inclusive. 

5. Calculate the limits on the ninth decile for the counts of problem 1, section 
24. What information can you derive from these limits? $ 

6. Use Figure 2.61 on page 40 to determine the approximate sizes of Qi, md, 
and Q, for the birth weights recorded in Table 2.61. What information about 
the birth weights do these numbers give? Ans. 66, 81, 95 grams. 

7. Construct a frequency distribution table and a graph for the 4-day gains 
o pute the mean gain. 

a ns on the 10th percentile of the 4-day gains of Table 
2.62 and interpret these numbers statistically. : Ans. —30 to —1.7, inclusive. 

9. Construct a relative cumulative frequency distribution table for the birth 
weights listed in Table 2.61, using the class limits indicated in Figure 261. 

10. Suppose that a student entering college takes the following tests: a 
general psychological test, a reading test, a mathematics aptitude test, a social 
Science aptitude ‘test, and a physical science aptitude: test. If his respective 
percentile ratings are 90, 87, 50, 92, and 63, what advice would you give him 
regarding a choice of a curricu that you have faith in these tests? 


Explain your reasoning. : 
n Dietas the lower limit of the upper (tenth) decile for the ACE scores 


of Table 2.01. 


lem 6, section 24? 


lum, assuming 
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TABLE 2.61 


Birta WEIGHTS or FEMALE (F) AND MALE (M) Guinea Pics Born 
DURING A PARTICULAR EXPERIMENT 


(Data obtained from H. L. Ibsen, Kansas State College.) 


Jan, F 65.3 106.0 100.7 52.0 81.6 83.9 89.6 Nov, M 
77.6 73.3 106.2 74.4 60.3 93.6 90.0 87.2 118.9 
100.2 80.0 84.7 63.0 53.9 43.9 92.2 83.7 108.4 
66.7 57.2 69.4 65.0 73.1 84.6 79.2 98.2 75.9 
62.7 64.3 79.5 70.0 81.5 87.0 105.5 98.6 
72.7 771 . May? 85.5 77.3 63.7 115.3 82.2 122.3 
82.9 63.4 93.5 72.4 67.4 64.3 68.0 123.6 99.6 
59.4 57.7 84.5 90.7 58.2 54.6 52.5 86.8 89.4 
99.0 85.2 75.6 57.5 80.6 79.3 
Jan, M March, M 71.5 97.3 62.2 59.3 Sept, F 90.0 106.7 
87.4 94.0 77.8 57.0 62.2 81.8 69.6 95.5 
97.3 80.5 768 Tone M 41.1 74.2 57.4 109.8 105.1 
97.6 84.3 91.1 112.0 66.6 61.7 70.2 119.1 
107.3 73.7 57.3 44.5 41.5 64.7 53.0 Oct, F 130.4 
86.7 79.1 96.7 36.1 50.7 63.2 80.7 60.9 102.7 
58.9 92.6 112.4 60.3 48.6 61.6 77.4 92.9 100.6 
- 75.3 82.1 79.8 63.0 48.0 81.2 91.2 93.5 
46.1 90.0 100.3 69.1 455 Am M 80.9 96.4 134.0 
56.2 100.8 66.1 51.6 102.7 76.8 50.4 136.3 
Feb., F 63.2 91.0 53.8 65.8 117.2 64.0 76.2 113.5 
77.3 61.8 84.6 82.1 45.8 63.7 96.5 115.6 
75.0 109.6 88.0 63.0 65.9 55.0 87.8 Oct., M 74.9 
72.3 76.5 69.0 62.5 63.4 75.2 89.2 124.9 80.0 
77.6 68.9 75.6 79.4 67.3 85.2 109.0 107.0 98.2 
1055 - 67.9 83.9 94.6 46.6 76.4 91.2 91.7 89.7 
98.8 57.8 72.1 74.0 49.0 47.4 121.1 119.2 82.2 
88.7 78.8 97.4 82.0 49.9 66.6 91.5 109.6 106.7 
76.5 73.9 104.3 87.6 55.3 82.0 87.6 107.8 E 
56.7 75.5 67.0 80.5 79.6 78.8 008 Dio. p 
90.1 68.2 64.1 63.6 Aug. F 66.5 91.0 101.9 110.6 
138.4 65.0 94.6 78.0 53.3 62.9 112.2 68.1 97.9 
51.4 94.2 70.3 124.5 88.8 73.6 94.7 72.0 76.8 
133.4 59.8 83.2 74.3 56.9 87.8 94.2 
44.4 56.2 -July,F 45.0 88.3 76.0 Nov. P 89.3 
Feb., M 50.6 37.0 85.9 92.9 67.7 94.0 87.9 
065 6 Ager T 54.7 85.8 58.4 91.2 90.8 81. 
105.0 80.8 May, M 51.6 89.0 45.6 93.1 117.7 a 
80.8 95.5 1100 65.7 71.2 98.0 84.8 103.5 44 
76.5 82.2 112.7 53.7 121.9 92.0 99. 2 e 
104.9 69.2 109.4 76.1 1904 94.7 n e dre 
1042 972 — 1134 "543 78.3 DET dept M de 87.3 
: 9.3 115.6 47.3 g ' i 
104.0 68.9 1027 Li "4 00.3 727 785 on 
105.0 75.55 — 1106 42.1 91.3 72.5 53.0 os | 
83.1 752 80.3 40.8 99.9 98.4 84.0 TE Hea 5 
i Ei 102.6 65.8 ; ; "b 
94.5 84.5 77.8 38.7 n ug figs CE e 
102.6 102.5 47.3 55.0 33.5 70. 
80.0 71.6 82.6 54.6 56. i: ord HS 
67.2 63.9 59.7 68.9 232 Pd ia "tes es 
March, F Apr., M dae Pn E 52 are ee b 
880 110.3 77.5 43.6 67.0 773 90.8 oup 169 
90.0 96.8 61.5 ; 78.3 759 90.8 99.5 94.5 
99.3 BUB C MIS e Judy, M 70.6 75. c m1 dis 
502 117.8 1972 97.8 82.0 70.2 ig 1020 one 
oe 64.4 54.6 80.1 77.7 90.8 bce d 
105.2 M ESO. 38.6 88.9 86.7 E 195 om 
o s RAE 225 70.8 77.8 81.2 107.5 


49.5 88.2 34.2 102,1 
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TABLE 2.62 


Four-Day Gars or FEMALE (F) AND or Mate (M) Guinea Pros 


pa 


m 
seotboo-ca 
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15.9 
31.5 
26.0 
19.2 
22.5 

5.5 
21.3 

8.0 
20.0 
21.4 
19.2 
11.4 


pa LD pá mt 
9 9 o toto o 
Om © tr c0 to 


DESCRIBED IN TABLE 2.61 


Negative sign denotes loss of weight.) 


19.1 Oct., M 
15.2 22.0 
25.6 20.4 
12.0 17.6 
1.5 5.5 
12.1 12.5 
12.9 213 
10.9 15.8 
9.7 — 30.0 
0.3 —22.1 
— 5.3 — 23.3 
Sept. M Nov., F 
— 26.3 — 0.8 
4.5 — 0.4 
3.7 — 21.7 
14.8 25.6 
26.4 8.4 
10.8 — 10.3 
9.5 2.6 
13.7 4.2 
14.7 4.5 
7.2 24.6 
5.7 23.3 
18.7 8.6 
18.3 7.3 
— 2.3 7 
22.8 15.0 
8.9 —16.5 
28.5 26.1 
23.4 8.6 
13.0 11.0 
14.8 13.7 
— 0.6 — 0.8 
10.1 5.7 
14.1 13.2 
16.4 16.8 
29.4 
11.1 Nov., M 
5.2 29.5 
6.9 26.0 
10.7 — 3.3 
20.0 0.7 
20.3 
Oct., F 9.1 
17.6 5.8 
25.8 0.6 
19.7 6.2 


na 
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Figure 2.61. Graph of the relative cumulative frequency 


distribution for the 
data of Table 2.61. 


2.7 THE COEFFICIENT OF VARIATION 


There is considerable need for a me 
one set of numbers as compared with a 
ure, or the levels of magnitude of measurement, are quite different. 
The standard deviation, the mean deviation, and the range are ex- 
pressed in the same units as those in which the data were taken; so 
they obviously reflect the general size of those units. 

Suppose that one bushel of a partieular sort of crop weighs 60 
pounds, on the average. Then the frequently used unit "pounds per 
1/1000 acre" is but 0.06, or 6 per cent, of the size of the unit “bushels 
per acre." Hence to convert yields in pounds per 1/1000 acre into 
bushels per aere would require multiplication by 1624. To see what 
effect such a procedure has on the standard deviation, consider the 
general case of two measurements X and kX, where the k is any con- 
stant. For example, X could be the number of pounds per 1/1000 


re 22 
acre so that k — 1624. By formula 2.13, ex = 2x (ZXy/N 
N 


is the standard deviation of X. For the measurement, kX, 


E(kX)! — (XX y? 2 B. 
RN ) PEDAL PE BENN | | 
N N ox. 


asure of relative variability in 
nother when the units of meas- 
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In other words, the standard deviation of the yields in bushels per 
acre is 1624 times that of the same yields expressed as pounds per 
1/1000 aere. Hence, even though e is an excellent and widely used 
measure of the variability exhibited by a group of numerical meas- 
urements, its size does depend directly upon the units of measure 
involved, and also upon the level of magnitude of those measurements. 
To illustrate the point regarding the level of magnitude of measure- 
ere interested in knowing if the weights of 
thirty-year-old males in Manhattan, Kansas, were more (or less) 
variable than the weights of twelve-year-old boys in that eity. Sup- 
pose also that the average weight of the men is known to be approxi- 
mately twice that of the youths. The analysis just presented shows 
that if the boys’ weights were each to be doubled so they would be 
on a level comparable to that of the men, their standard deviation 
automatically would be doubled too. It does not seem reasonable 
that doubling all of the X’s in a set of measurements should change 
their fundamental variability relative to another set of measure- 
ments; hence there is need for à measure of variability which would 
not be so affected. The coeficient of variation is that sort of measure 

of relative variability. . y . 
It is easy to see that the mean of kX is k times the mean of X be- 
Therefore, the ratio of 


cause px = X(kX)/N = kX(X)/N = kux- ; 
the standard deviation to the arithmetic mean will be a measure of 


relative variability in a useful sense because 


ment, suppose that one w 


orx kox _ OX 
ux kex BX 


#0). It is customary to express this ratio 
the arithmetic mean as a percentage, and 


(CV) by 


regardless of the size of k( 
of the standard deviation to the a 
to define the coefficient of variation 


CV = 100c/n. 

previously discussed data, the stu- 
dent can verify that, for ACE scores, CV = 27.3; for the tiri weights 
of guinea pigs, CV = 25.2; and for problem 5, section 2.5, CV = 18.9, 
each as a per cent A person acquainted with ACE scores might then 
observe ins the scores at Ka State College during 1947 were 


nsas 5 : el 
relatively more variable than the national scores, which (it is Sup- 
posed for illustration) had CV = 20 per cent. Concerning the birth 
weights, we might learn that some other 


group of these animals has 
um p n 
à standard deviation of only 15 grams; a 


(2.71) 


To illustrate formula 2.71 from 


d hastily (and erroneously) 
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conclude that they were more uniform in weight than those whose 
weights are reported in Table 2.42. However, if the second group 
has a mean weight of p = 50 grams, it then is apparent that CV — 
100(15/50) — 30 per cent. Hence Professor Ibsen's guinea pigs had 
less variability in weight at birth than the other group of guinea pigs 
when account is taken of the fact that they were generally heavier. 


PROBLEMS 


1. Compute the coefficient of variation for each of the following and draw 
appropriate conclusions: 


X (N. Y. Curb Issues): 4, 3, 88, 1, 108, 42, 1, 25, 18, 5, 3, 6, 2, 22, and 70; 
Y (Bond Quotations): 88, 115, 104, 113, 119, 80, 66, 40, 31, 101, 48, 43, 100, 84, 
and 15. 
2. Using the X, as —2, 5, 8, 8, 
strate the ey y = k-oy. 
3. Suppose that a group of measurements of the yield of corn in a certain 
area of Iowa had u = 70 bushels per acre, with e = 10 bushels, whereas an area 


in Kansas, growing the same variety of corn and employing the same agronomic 
method of culture, gave yields with y = 40 bushels 


per acre. Are the yields in that part of Iowa rel: 


1, 0, —2, 4, 3, and 6, and using k = 2, demon- 


g a certain period of years 


I i the prices of a certain com- 
modity averaged $1.25, with standard dev: 


iation of 25 cents, The prices of this 
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28 SOME OF THE PROBLEMS CREATED WHEN ONLY 
A SAMPLE OF A POPULATION IS AVAILABLE 
FOR STATISTICAL STUDY 


at we wished to study ACE scores of college students 
e or the expense required to analyze all 
of them, say 50. Although 
btained, several new prob- 


Suppose th 
but could not afford the tim 
their scores, and hence took only a portion 
an economy of time and money will be o 
lems will be created. 

First, how should the 50 students be chosen for the sample? 
Ideally, they should be representative, in all important respects, of 
the whole group which is being sampled. But this cannot be ascer- 
tained without studying the ACE scores of the whole group—and 


then no sampling would be needed. If the first 50 on an alphabetical 
list were to be taken, the MacIntoshes, MeTaverishes, Swensons, and 
Swansons never would be chosen; and they might differ fundamentally 


from those who would be chosen. If the first 50 who came into the 
counseling bureau were taken as à sample, they might differ as re- 
gards ACE scores from those who came in later, or who never came 
into the bureau at all. In view of these and similar dangers of 
acquiring a biased sample from such procedures, it is necessary to 
devise a sampling method such that every eligible student has an 
equal and independent opportunity to be chosen in the sample. The 
net result of these requirements is to make it true that every pos- 
Sible sample of the chosen size (50 in the example above) will be 
equally likely to be drawn. This is the fundamental requirement of 


random sampling. 

There are various ways to dra 
1290 members of a population. 
who took the ACE test a differen 


Pieces of cardboard, and draw 50 of them at ra 
Sense) from a bowl containing all of the pieces of cardboard. lf the 


Scores in the population are recorded in rows and eee as in 
Table 2.01, we can assign numbers to the rows and to the columns, 
.01, ssif 


and then draw a row number and a column number at random as 
ther will uniquely designate a score 


before. se tw mbers toge E z 

for the E tl is done 50 times—ignoring any repeats of 
exactly the same row-column combination—this sample also will be 
à random sample because every possible set, of 50 wlan b eer d os 
1290 in the population will have had an equal opportunity to have 


been drawn. 


w a random sample of 50 from among 

One would be to assign each person 
t number, place these numbers on 
ndom (in the popular 
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The following random sample of 50 scores from Table 2.01 was 
obtained by the second method described above: 


131, 66, 117, 117, 145, 71, 118, 99, 128, 111, 95, 78, 88, 55, 86, 89, 97, 
98, 87, 80, 100, 76, 124, 89, 79, 101, 89, 156, 111, 98, 103, 68, 110, 76, 
99, 100, 102, 61, 50, 125, 92, 106, 63, 117, 124, 87, 95, 100, 58, and 99. 


These particular measurements were obtained by chance from among 
many possible different sets of 50. This fact suggests that the theory 
of probability is needed in the analysis of sampling data. 

It is found in the usual manner that the mean and the standard 
deviation for the sample above are 96.28 and 22.55, respectively. 
The range of scores in this sample is 106, the median is 98, and the 
coefficient of variation is 23.4 per cent. It is known that these sta- 
tistical measures are not likely to be exactly the same as the popula- 
tion parameters, but it is to be hoped that they are not far from 
those values. 

Another sample was drawn in the same manner 
described. The following were calculated for th 
mean — 99.22, standard deviation — 27.30 
100.5, and the coefficient of variation is 2 
that each of these statistical measures is di 
yet only the ranges differ by a large pe 
random samples that they usually differ 
respects because the partieular members of such samples are in the 
sample by chance. It also is true that the sizes of such statistical 
measures as the sampling mean will follow some predictable pattern 
over considerable sampling experience. If this were not true, nothing 
much.could be learned from sampling. It will be seen in later chap- 
ters that probability theory is needed to study these matters, 

To illustrate the effect of the type of population on the results 


obtained from random sampling, consider two samples dr. 
the data in problem 1 at the end 
samples of 10 numbers each were d 
what larger fraction of the popul 
2.01. These samples were obtaine 
were numbered serially from left 
There are 148 fly counts in this p 
drawn by effectively drawing 1 
the numbers A 
drawn, the correspon: 


as the sample just 
is second sample: 
, range = 144, median = 
7.5 per cent. It is noted 
fferent for the two samples, 
reentage. It is typical of 
from each other in several 


awn from 
of section 2.4. For convenience, 


rawn even though this is a some- 
ation than was taken from Table 
d by considering that the 
to right, starting with the top row. 
opulation; hence a sample of 10 w 
0 numbers at rand 
--, 148. When these 10 ordin 
ding fly counts were obtained 


fly counts 


as 
om from among 
al numbers were 
by counting in 
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the manner indicated above. Following are the summaries of the 
two samples: 


Sample 1 Sample 2 
Mean = 23.2, median = 19 Mean = 13.4, median = 10 
Standard deviation = 16.2, range = 45 Standard deviation = 9.7, range = 28 
CV = 69.8 per cent CV = 72.4 per cent 


For these two samples, each of the five statistical measures is dif- 
ferent again. Moreover, considering the fact that these fly counts 
are generally smaller numbers than the ACE scores, the relative 
differences are much larger between the two samples than was true 
for the ACE scores. For example the mean for one sample of the 
counts is almost twice the size of the mean of the other sample. Much 
the same is true of each of the other measures except the coefficient 
of variation. This is an illustration of the fact that a statistical study 
of samples requires some information about the frequency distribu- 
tions of the populations sampled. Hence this matter, and probability, 
must be studied before more can be done about the analysis of sam- 
pling data. These are the aims of chapters 3 and 4. 


REVIEW PROBLEMS 


1. The effectiveness of penicillin in controlling bacterial growth can be meas- 
ured by the “inhibition zone” produced when a standard amount of penicillin is 
properly added to a plate of agar containing the type of bacterial growth one 
wishes to study. Following are 54 such determinations arranged in 9 groups 
of 6 tests each. (From an article by Jackson W. Foster and H. Boyd Woodruff, 
Journal of Bacteriology, August, 1943.) Calculate the arithmetic mean of each 
set of tests, and then compute the standard deviation of these nine means. 


Test 1 Test2  Test3  Test4 Test5 Test6 Test7 Test8 Test 9 
28.1 285 28.0 275 29.0 280 29.0 28.5 280 
280 280 282 280 28.0 280 290 282 290 
275 280 275 281 283 280 273 282 286 
278 980 380 285 273 280 27.0 275 2977 
28.0 27.5 28.0 28.0 29.0 28.1 29.0 28.0 27.5 
2975 9280 2380 282 20 275 290 280 290 


2. Determine the range for each set of data in problem 1, and compute the 


standard deviation for the set with the greatest range. 
Ans. 0.6, 1.0, 0.7, 1.0, 1.7, 0.6, 2.0, 1.0, 1.5; 0.88. 


rm acreages in à certain county in Kansas forms a statis- 


3. If a list of the fa 
and if y =600 and md = 350 acres: 


tical array of numbers from 35 to 4000; 


(a) Which average would you think might be more typical of the size of 


farm in that county? T» 
(b) Would you expect the high point of the frequency distribution to be 


about over the mean, to its right, or to its left? Give reasons. 
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4. Determine the median and also the upper limits of the first and third 
quartiles for the data of problem 1 when all results are considered as one group 
of data. Ans. 28.0, 27.95, 2822. 

5. Take any newspaper which gives quotations from New York bond prices 
and make frequency and r.c.f. distributions for all the closing prices as listed 
for that particular day. 

6. Graph the distributions for problem 5. Then determine graphically the 
proportion of the prices which exceed 100. Check this result by actual count. 

You are given the following information as a basis for working problems 7 
to 11, inclusive. These data are from the Ohio Psychological Tests given to 
602 students at Kansas State College during 1945. The scores are represented 
by the symbol X in the following summary, and are given only as integers: 


Score Class VÀ Score Class v 
At least 111 3 55-61 70 
104-110 9 48-54 74 
97-103 16 41-47 72 For ungrouped data: 
90- 96 21 34-40 65 EX = 36,000 
83- 89 32 27-33 49 ZX? = 2,400,000 
76- 82 49 20-26 12 
69- 75 59 Less than 20 5 
62- 68 66 == 
Total 602 


7. Make the 7.c.f. distribution and graph it. 


8. Compute approximations to y and o2 after changing the top and bottom 
classes to 111-117, and 13— 


19, respectively. Ans. 58.6, 423.0. 

9. What percentage of the Students had scores above 100? Between 50 and 
75? What range of scores is included between the S0th and 90th percentiles? 
10. Graph the frequency distribution and state which average you would 


employ to convey the best impression of the level of performance by these 
students on this standard test. 


xhénsuréments = 600 for any particular set of 
13. Giv à A tion. Ans. 48.9 per cent. 

Le y Sven the following frequency distributions for à certain group of prices 
d struct graphs of these distributions. Briefly describe the sorts 

be obtained from such 


illustrations, par 
Po Class $ nef. Price Class if. rcf. 
A XM p 5 1.00 85-90.99 70 +25 
vd 10 -99 79-84.99 30 dl 
E R 40 .97 73-78.99 15 .05 
97-1029 90 .89 67—72.99 7 .02 
s od 130 AL Less than 67 

— 96.99 100 -45 A g 


Total 500 
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14. Caleulate the median and the lower limits of the second quartile and of 
the S5th percentile of the data in problem 13. Ans. 98.14, 90.99, 107.32. 

15. What percentage: of the bonds of problem 13 had prices below 100? 

16. Within what limits do the prices of the “middle” 60 per cent of the 
bonds lie (that is, excluding the lower and also the upper 20 per cent of the 
prices in the array)? d Ans. 88.85 to 105.99. 

17. The figures recorded below are the batting averages of all American Asso- 
ciation players who were at bat at least 100 times, as reported by the Kansas 
City Star on July 29, 1951. The averages are arrayed from highest to lowest. 
Make a frequency distribution table, and compute u and c. 


Aver- Aver- Aver- 

Player age Player age Player age 
Walker .389 Federoft 295 Lerchen 266 
Cerv .354 Thorpe .292 Daugherty .205 
Crowe .335 Segrist .291 Markland .262 
Thompson .335 Mangan -290 Marchio 262 
Sullivan .332 Beard -288 Montag .262 
Wright .323 Milne .287 Scherbarth .262 
Richter .322 Bollweg .286 Atkins .260 
Katt .322 Lyons .284 Henley .260 
Clarkson .322 Unser -283 Lund .259 
Dandridge B17 Barnacle -282 Basso 256 
Whitman 315 Pendleton 280 Antonello 256 
Benson 314 Brancato -280 Hoak 254 
Reed .312 Deal .279 Marshall 254 
Mordarski 311 Chapman 279 Fernandez .252 
Hoderlein 310 Marquis -278 Kropf .250 
Mavis .310 Tipton 278 Turner 250 
Carey .309 Zauchin 277 Ruver 248 
Campbell .309 McQuillen 277 Conway 247 
Courtney .309 Ozark 276 Lu Cadello 243 
O’Brien .307 Dallasandro 276 Aliperto .237 
Broome (L) .307 Stevens 274 Natisin .235 
Saffell .305 Thomas 274 MeAlister 234 
Kalin .304 Cole 274 House .231 
Mozzali .304 Wright 274 Teed .230 
Cassini .301 Olmo 270 Thomson 223 
Merson .300 Klaus -269 Rocco 212 
Repulski .300 Gilbert -268 Morgan 179 
De La Garza .267 Okrie 165 


Broome (C) .295 
18. Referring to problem 17, write numbers from 1 to 84, inclusive, on pieces 
of pasteboard, mix them well, and then draw 5 at random. Consider each 
number drawn as the rank of a person's batting average in the above list, 
starting at the top. Do this 10 times, and record the range of batting averages 
in each set of 5 so drawn. Compare the average range with the standard 
17. If another such set of batting averages had 


deviation obtained in problem E j i 
twice as large a standard deviat what general effect do you think this would 


have on the range? 


ion, 


48 SUMMARIZATION OF DATA Ch. 2 


REFERENCES 


Dixon, Wilfrid J., and Frank J. Massey, Jr., Introduction to Statistical Analysis, 
McGraw-Hill Book Company, New York, 1951. 
Freund, John E., Modern Elementary Statistics, Prentice-Hall, New York, 1952 


Kenney, John F., Mathematics of Statistics, Part I, Second Edition, D. Van 
Nostrand Company, New York, 1947. 


Neiswanger, W. A, Elementary Statistical Methods, Macmillan Company, 
New York, 1943. 
Waugh, Albert E., Elements of Statistical 


Method, Second Edition, McGraw- 
Hill Book Company, New York, 1943. 


CHAPTER 3 


Elementary Probability 


Several relatively small populations of data have been studied be- 
cause it is not feasible to use large groups of data in the classroom. 
Quite commonly, populations actually involve a very large number 
of numerieal measurements; so large, in fact, that their number can 
be considered as infinite without doing appreciable violence to the 
subsequent analyses. Obviously, no more than a portion (sample) 
of the measurements in an infinite population can be obtained for 
study. Sampling theory requires certain probability considerations 
and some definite assumptions regarding the distribution of the meas- 
urements in the population (as noted in section 2.8). Hence it is 
nsider some of the more basic and widely used fre- 

before attacking the problems of sampling. 
and the following chapter. 

lved whenever the occurrence, or non-occurrence, 
of any anticipated event is dependent to some degree upon chance. 
An “event” can be any sort of occurrence or non-occurrence which 
has been specified in advance. In the classroom, red and green 
marbles might be placed in a sack, thoroughly mixed, and one drawn 
out without looking into the sack. The drawing of a green marble 
could be considered as the event E in this instance. Likewise, if a 
bridge deck is thoroughly shuffled and one card drawn at random 
from it, the appearance of the ace of spades then might be the event 


E. 

Another wide application of probability in everyday life lies in 
the determination of the premiums for life insurance policies and 
annuities. If a man aged 35 years purchases an annuity whieh will 
pay him $100 per month starting at age 60 if he is alive, there are 
three major matters to be considered: (a) interest on the money 

that the man will live to receive each 


the probabilities 
yment, and (c) operating expenses and a fair profit for 
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appropriate to co: 
quency distributions 
That is done in this 

Probability is invo 


involved, (b) 
successive pa 
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the company. Whether or not such a person does live to receive a 
partieular payment must be regarded as a chance event and, there- 
fore, requires some use of the theory of probability. Public opinion 
polls regarding political matters, buyers' preferences, and foreign 
affairs involve chance in the selection of the persons who are to be 
interviewed. The reader should be able to think of many other 
everyday events in which the theory of probability is involved. 


$1 THE DETERMINATION OF PROBABILITIES 


Before a method is presented for determining the probability that 
an event E will occur under specified conditions it is useful to dis- 
tinguish between what will be called single events and classes of 
events. For the purposes of this book this distinetion can be made 
by means of examples. Suppose that two dice are placed in a can, 
shaken vigorously, and rolled out upon a flat, hard surface. Many 
"events" can oceur with each die, but just six usually are of interest: 
a 1, 2, 3, 4,5,0ra 6 appears on the upper face of each die when it 
stops rolling. How the dice were turned when they were thrown, 
where on the surface they came to rest, or how many turns they 
made while in motion are ordinarily of no interest. Moreover, it 
would be at least impracticable, if not impossible, to relate those 
phenomena to the number of dots on the upper face of a die. Hence 
the six possible events which will be considered herein are the ap- 
pearance of a 1,2,3, 4, 5, or a 6 on the upper face of each die. Since 
these events cannot be further decomposed, we shall refer to them as 
single events. If, with each die, these faces tend to appear with equal 
relative frequencies over many trials, the dice are each said to be 
unbiased. It is with single events occurring with equal relative fre- 
quencies that we shall be primarily concerned in the subsequent dis- 
cussion. If both dice are considered simultaneously and an event is 
considered to consist of a number on one die and a number on the 
other die, thirty-six single events are possible because any one face 
on the first die can appear with any of the six faces on the other die. 
Each possible pair of faces defines an observable event, 

If attention is turned to the sum of the numbers of 
on the upper faces of two dice w 
neously, any one of eleven differen 
Possible sums define eleven classes 
equal relative frequencies), 


dots appearing 
hich have been thrown simulta- 
t sums is possible. The different 
0f single events (occurring with 
For example, the class of events (com- 


0 — 
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posed of such single events), “sum = 7,” contains the following single 
events: á 


1 on die 1, 6 on die 2; 6 on die 1, 1 on die 2; 
2 on die 1, 5 on die 2; 5 on die 1, 2 on die 2; 
3 on die 1, 4 on die 2; 4 on die 1, 3 on die 2. 


The class, “sum = 2,” includes but one single event because there is 
but one way that it is possible to get a sum of 2. The class, “sum = 
3,” includes two single events: a 1 on die 1, a 2 on die 2; or a 2 on die 
1, a1 on die 2. The class, “sum = 4,” includes three single events, 
ete., until all thirty-six of the possible single events have been put 
into one of the eleven classes of events. 

We could define other classes of events among the thirty-six single 
events possible when two unbiased diee are tossed. For example, 
we could have class 1 = “sum = 7" and class 2 = "sum is not = 7." 
There are six single events in class 1 and thirty single events in class 2. 

The preceding discussion has brought out the fact that single events 
and classes of events differ in one important respect. The single 
events are expected to occur with equal relative frequencies over 
many trials under the specified conditions, whereas the classes of 
events consist of groupings of single events, and hence would be ex- 
pected to occur with relative frequencies which depend upon the 
numbers of single events in the classes. 

Upon the basis of the preceding discussion, a useful method for 
determining probabilities can be devised for instances in which the 
single events occur with unequal relative frequencies. Suppose that 
under certain specified conditions any one of N possible single events 
can occur and that they form an exhaustive set; that is, some one of 
these single events must occur on any trial under the specified condi- 
tions. Assume also that the single events are grouped into s non- 
overlapping classes of events, with nı in class 1, nj in class dte 
and with », in class s. Then the probability that the single event 
which actually does oceur on one future trial will belong to class 1 


(i varies from 1 to s) is given by 
P(E) = n/N. 


tration of the use of formula 3.11 consider the dice prob- 
y-six single events are possible. 


(3.11) 


As an illus 


lem discussed above in which thirt | 
Certain classes of events, the single events which each class includes, 


and the probabilities associated with each class of events are given 
in Table 3.11. 
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TABLE 3.11 


Since Events AND CLasses or Events IxvoLvED WHEN Two UNBIASED 
Dice Are THROWN, AND THE PROBABILITIES ASSOCIATED WITH THOSE CLASSES 


or Events 
Classes of Single Events Number of Probabilities 
Events Diel Die2 Single Events for Classes 
Sum = 2 1 1 1 (=m) 1/36 
Sum = 3 1 2 2 (= nj) 2/36 
2 1 
Sum = 4 1 3 3 (= m3) 3/36 
3 T 
2 2 
Sum = 5 1 4 4 (= mn 4/36 
4 1 
2 3 
3 2 
Sum = 6 | 5 5 (= ns) 5/36 
5 1 
2 4 
4 2 
3 3 
Sum = 7 1 6 6 (= ne) 6/36 
6 1 
2 5 
5 2 
3 4 
4 3 
Sum = 8 2 6 5 (= nz) 5/36 
6 2 
3 5 
5 3 
4 4 
Sum = 9 3 6 4 (= ns) 4/36 
6 3 
4 5 
5 4 $ 
Sum = 10 4 6 3 (= ng) 3/36 
6 4 
5 5 
Sum = 11 5 6 2 (— ny 2/36 
Sum = 12 6 6 


1(= my) 1/36 
Zn; = N = 36 
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Other classes of single events could be defined, of course, such as 
the two classes: “sum <£ 5" and “sum > 5.” From Table 3.11 it is 
apparent that 10 of the 36 possible single events produce sums which 
are less than or equal to 5, whereas the remaining twenty-six single 
events yield sums which are greater than 5. Therefore, in this case, 
N = 36, s = 2, n = 10, and ne = 26; so that P(sum = 5) = n/N 
= 26/36 = 13/18, or = .72. 

Two useful facts are derivable from formula 3.11: 


(3.12) 0 < P(E) X 1 because no n; can be larger than N; and, 
(3.13) P(E) + P(not E) = 1 because ni/N + (N —n)/N 
=N/N =1. 


Other laws follow from formula 3.11. Two of the more important 
theorems will be proved and illustrated. Suppose that E, and E» 
denote two mutually exclusive classes of events; that is, single events 
in classes E, and E» cannot occur simultaneously on any one trial. 
Suppose also that there are nı and ne single events in classes E, and 
Bx, respectively. If a total of N single events is possible, the prob- 
ability that an event in either class E, or class E» will occur on one 


random trial is, by definition, 
(3.14) P(E; or Ez) = (m + 2)/N = m/N + n/N 
= P(E,) + P(Es). 


The same reasoning and algebra are sufficient to show that for r 
classes of events: Ei, Es, ..., Er with m single events in class E; 
(i = 1 to r), the probability that some one of the mutually exclusive 
., E, will occur on one random trial is given by 


-, or E) =P(Es) + P(E2) t PS. 
f Total Probability for Mutually 


events E,, Es, .. 
(3.15) P(E; Es, ** 


This result is known as the Law o. 
Exclusive Events. 

To illustrate formula 3. 
15 red, 5 white, and 20 purple marb t 
What is the probability that a colored marble will be drawn on one 
future random trial? Let E, stand for the drawing of any one of the 
There are 10 green marbles, and each is equally 
re are 10 single events in the class E,. 
f any red marble, E; stand for the 

equal the drawing of any purple 


15, suppose that a sack contains 10 green, 
les, all identieal save for color. 


green marbles. 
likely to be drawn; hence the 
Also, let E» represent the drawing O 
drawing of a white marble, and Es 
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marble. Then n; — 10, n» — 15, n; — 5, and m, — 20; hence 
P(E,, E», or Es) = (104 15 + 20)/50 = 10/50 + 15/50 + 20/50 
= .90. 

In the discussion leading to the Law of Total Probability the events 
Ei, Ez, ..., E, were assumed to be mutually exclusive, that is, only 
one of those events could occur on any one random trial. Suppose 
now that E, and E» are independent events in the sense that each can 
occur simultaneously on one trial without interfering or helping with | 
the occurrence of the other in any way. For example, the obtaining 
of a 6 on one die and a 5 on another die on a single throw of the pair 
of dice is an illustration of independent events. If E, and E, are 
independent events, they can occur together in ny-n2 combinations of 
single events because each of the n; single events in E, can occur 
with each of the nz single events in E». The classes of events, Ey 
and Hs, will each belong to a general class of events, which will be 
supposed to contain N; and Ne single events, respectively. There- 
fore, the total number of combinations of single events possible on 
random trials now is N,-Ns. Of those possible single events, n4*no 
will belong to both E, and By. "Therefore, the probability that an 
even in E, will occur simultaneously with an event in E» is given by 


P(E, and Ey) = (M1 n2)/(N, Na) = (n1/N 1) -(n2/N2) 
- P(E))- P(Es). 


As an illustration of the above discussion and results, suppose that 
à game consists in throwing a penny and an unbiased die simul- 
taneously, with the thrower winning if he throws a head on the 
penny along with a 5 or a 6 on the die. Let E, represent throwing a 
head on the coin, and E, throwing a 5 or a 6 on the die. A single 
event now consists of a particular result on the coin plus a particular 
result of throwing the die. The coin can turn up either of two ways, 
the die any of six ways; hence there are 2(6) = 12 combinations of 
events, each equally likely to occur on any one trial. In these cir- 
cumstances, nı = 1, np = 2, Ny = 2, and Nə = 6; hence 


(02) 
P(H and a 5 or a 6) = ——— = (1/2)(2/6) = 1/6. 
00 a 
The reasoning and algebra above can be e 


xtended easily to prove 
that, if the occurrence of events in classes Ei, E», ..., and E, are 


independent and can occur in ^, out of N; ways, respectively (i = 1 
to 7), the probability of the simultaneous occurrence of these r events 
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on one future random trial can be obtained from the following 
formula: 


(3.16) P(E, Es, ---,and E) = P(Z)-P(Es) -++ P(E,). 


This result is known as the Law of Compound Probability for Inde- 
pendent Events. 

A similar and more general law than (3.16) can be established for 
situations involving dependent, rather than independent, events. 
Suppose that the occurrence of event E» depends on another event 
E, having occurred previously, or that it is useful to regard E»'s oc- 
currence as depending on the prior happening of Ey, perhaps for the 
sake of convenient computation. For example, suppose that a bridge 
deck is to be well shuffled, and then two cards drawn successively 
and at random without replacing the first eard drawn. What is the 
probability that the second card to be drawn will be an ace? It 
should be apparent that the answer depends somehow on the out- 
come of the first draw from the deck so that the second event is 
dependent upon the first event. 


To attack the problem rather generally, suppose that n single 


events are possible under a given set of cireumstances and that an 
event E, is associated with nı of these n single events. Assume also 


that an event E. occurs on 73 of the events on which E, also occurs. 
Then the probability that both E, and Z will occur on one trial is 
P(E,E») = ny2/n, which can be rewritten in the following way: 


à n m2 [; 
Gi) PUB) =~" = (2) (“) T 


where P(H»/E;) is the probability that E» will occur after it is known 
th p "red. 

m NY pe expressed in (3.17) actually includes the law 
of (3.16) as a special case. If E» is independent of Ei, the number 
of single events on which E» can occur will be the same regardless 
of the prior occurrence or the non-occurrence of Ey; hence the prob- 
ability P(E./E,) will be just P(E»), and the formula 3.17 becomes 
(3.16). 


Problem 3.11. What is the probab 
from a well-shuflled bridge deck if the um 
the second draw is made? 


ility that two successive aces will be drawn 
t card drawn is not replaced before 


ble for the first draw is 4, and any of 


il 
The number of aces aval one P(E, = 4/52. On the second draw 


52 cards might be drawn; hen 
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—assuming that an ace was drawn on the first draw—there are 3 
aces among 51 cards remaining in the deck. Hence, P(E=/E,) = 
3/51; then, by formula 3.17, P(A, A) = (4/52) - (3/51) = 1/221. 
There are many situations in which those chance occurrences 
which would be considered as the single events do not occur with 
equal relative frequencies. For example, a coin may be biased so 
that the heads side turns up more frequently than the tails side. Un- 
der such conditions, we cannot assign a probability of 1/2 to the 
occurrence of heads (and likewise for tails) and employ the simple 
arguments used above. However, we can think of determining the 
appropriate probabilities for these single events by empirical means, 
that is, by many actual trials under the specified conditions. For 
example, we could toss the coin in question many times and then use 
the observed proportion of heads as an approximation to the true 
probability, p. Thereafter, formulas 3.15, 3.16, and 3.17 can be used. 
An interesting and instructive application of the probability meth- 
ods introduced above can be made to the study of human blood 
groups. If the red blood corpuscles of one individual’s blood are 
mixed with the blood serum of another person (as in transfusions), 
one of two general results will be observed to follow: 
puseles will disperse ev 
in their ow 


the red cor- 
enly through the recipient's blood as though 
n serum, or they will form clumps of cells. The latter 
reaction is called agglutination, and it is so undesirable that there is 
considerable interest in preventing its occurrence. To that end, 
bloods are classified according to certain systems. One such system 
is based on the known existence of factors A and B each (or both) 
of which may be either present or absent from any person’s blood. 
The following four blood groups are based on the A and B factors: 


(1) type O: neither A nor B present in the blood; 
(2) type A: factor A present but not factor B; 
(3) type B: factor B present but not factor A; 


(4) type AB: both of the factors A and B are present in the blood. 


. There are several interesting features about the A and B factors 
in blood. (1) They are inherited essentially in aecord with simple 
Mendelian laws of inheritance, a circum: 

ures of probability. (2) Various raci 
differ from each oth 
factors, thus providi 
racial origins, 
be used in gene 


stance which requires meas- 
al or geographic groups tend to 
er in the proportions carrying the A and/or B 
ng a source of some additional evidence about 
(3) The mode of inheritance of the A-B groups can 
tic studies and in some legal problems, 
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Each person is considered to have in the cells of his body twenty- 
four pairs of chromosomes, one member of each pair having come 
from each parent. On these chromosomes are carried genes which 
are believed to govern the inheritance of various human character- 
isties. The genes which determine the presence or the absence of the 
A and B factors in the blood are carried on one of the twenty-four 
pairs of chromosomes. Attention here is centered solely on the chro- 
mosómes of that pair, one of which came from the father, the other 
from the mother. Moreover, attention is to be fixed upon one spe- 
cific gene position on each such chromosome, namely, that position 
occupied by the genes which cause the presence or the absence of the 
A and B factors. If the gene at this position produces neither the 
A nor the B factor in the blood, it is marked (diagrammatically) 
as an O gene. Similarly there are A and B genes so that the blood 
type can be indicated by showing what genes the two chromosomes 
carry. Symbolically, there are the following four blood types: 


0/0 = type O; A/O or A/A = type A; B/O or B/B = type B; and 
A/B = type AB. 


The information presented in the preceding paragraphs makes it 
Possible to predict the proportions of the various blood groups among 
the progeny of any particular combination of parents, provided that 
a large number of such parents and children are involved. Suppose 
that one parent has blood of type AB and the other has type O blood. 
Then the possible blood types which can occur among their offspring 


are as follows: 
Father Mother 


A/B X 0/0 
Parents 
Genes passed on AorB T OorO 
: B/O 
Possible offspring A/O 
A/O 


s of chromosomes from the father and 
B blood in the child because only the 
the chromosomes. Similarly, the 


Of the four possible pairing 
the mother, two produce type 


gene for the factor B is carried on A 
other two possible pairings of chromosomes produce type A blood in 


the children. There is no reason to doubt the usual hypothesis that 
each of these four possible pairings occurs the same percentage of the 
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time in the long run of many such matings; hence the individual pair- 
ings are considered to be single events. It follows that the probability 
that any specified future child of these parents will have type B 
blood is P(B) — 2/4 — 1/2. Similarly, P(A) — 1/2. 'There is no 
possibility that these parents will produce a child of either type O 
or type AB; hence P(AB) — P(0) = 0/4 = 0. 

In view of the fact that the inheritance which a child receives from 
its father is an independent event with respect to its inheritance 
from its mother, formula 3.16 can be applied. This gives the follow- 


ing simple solution for the probability that a child with blood type 
B will be produced: 


P(B from father) = 1/2; P(O from mother) = 1; therefore, P(B 


from father and O from mother) = (1/2) (1) = 1/2. 


Since that is the only w. 
by these parents, that 
If one parent is typ 


ay a child with B-type blood can be produced 
is the solution to the problem. 

e O and the other type A, something must be 
known or assumed regarding the specific type A, that is, A/O or A/A. 
If one parent is A/O and the other is type 0/0, P(A) = P(O) = 1/2. 
No other type is possible. But if the parent with type A blood is 
actually A/A, all children will be A/O. 

If it is known only that the parents are of types A and O, respec- 
tively, and if it assumed that type A is equally frequently A/O and 
Fs the probability that any particular future child will be type 

is 


P(type A child) — P(A/O parent) -P[A child/(A/O parent)] 


+ P(A/A parent) - PJA child/(A/A parent)] 
= 0/2). (1/2) + (1/2)-(1) = 3/4 
by the probability laws of (3.15) and (3.17). 
j Table 3.12 was derived by the above methods under the assump- 
tion that type A is equally frequently A/A and A/O; and similarly 
for type B. (Actually this assumption is unrealistie, but it is con- 
venient here.) The reader should verify several of the probabilities 


in this table, noting particularly where the assumption regarding 


the relative frequency of A/A and A/O among type A parents (or 
likewise for type B) affects the calculations. 
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Mother's 
Type 
(0) 


TABLE 3.12 


TYPES AND PROPORTIONS OF OFFSPRING FROM THE INDICATED MATINGS 


Father’s 
Type 


U » o 5 w po 


Types of 
Progeny 


BW [zin BS we Bor mo gero mo BS RES PO PO We WO ROO 


Probability 
of That Type 
of Progeny 
T 
1/4 
3/4 
1/4 
3/4 
1/2 
1/2 


59 
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Another method of classifying human bloods is based on the M 
and N factors, which are inherited independently of the A and B 
factors; that is, the genes for M and N are on a different pair of 
chromosomes from that which carries the gene for A and B. Appar- 
ently, both M and N are never both absent. There are, therefore, 
three types: M, N, and MN if we ignore the subtypes w 
been discovered recently. The following symbolism w 
in the discussion of the M-N blood types: 


just 
hieh have 
ill be employed 


M/M = type M, N/N = type N, and M/N = type MN. 


The inheritance of these types can be studied in the manner already 
established for the A-B blood groups. 

In view of the fact that the three M-N types are classifications 
which are independent of a person’s A-B type, the two blood group- 
ings considered simultaneously make it possible to distinguish 
3 X 4 = 12 different blood types even without bothering with the 


subdivisions of the A-B and M-N groups, which are serologically 
determinable. 


Problem 3.12. If a woman's blood b 
band’s blood is AB and N, what are th 
and what is the probability associated 


elongs to types O and MN, and her hus- 


e possible blood types for their first child, 
with each type? 


The mother can pass on to her child one of the pairs of genes 
0, M or O, N because her genetic constitution as regards blood types 
is (0/0) (M/N). Likewise, the father can transmit either A, N or 


the possible gene combinations in 
, M with B, N; O, N with A, N; and 
the following four classifications of 
-MN, A-N, B-MN, and B-N, 
equently. Hence, each of these 


Y of one-fourth occurrence in 
pe can occur, 


roups will be mentioned: one involves 
ased on the P factor. Each is inherited 
and of the A-B and M-N types. Recent 


» and RhRh and Rhrh are 
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The information given previously on blood groups can be sum- 


marized as follows: 
Broop TYPING SYSTEM 


I(A-B) II(M-N) III(P) IV(Rh) 
0/0 = group O M/M -typM P/P Rh/Rh 
or or 
A/O N/N = type N P/p = type P+ Rh/rh = type Rh+ 


or > 
A/A = group A M/N = type MN p/p = type P— rh/rh = type Rh— 


B/O 
or 
B/B = group B 


A/B = group AB 

It is seen that there are 4x 3 X 2 X 2 = 48 mutually exclusive 
and serologically distinguishable blood classifications, even with the 
simplified groups discussed herein. By using all the known sub- 
groups of blood types, there are many more distinguishable and mutu- 
ally exclusive classifications of human bloods. The availability of 
these classifications has been helpful in legal cases involving dis- 
puted parentage, heirship claims when alleged maternity is doubted, 
identification of blood stains, genetic studies, anthropological inves- 
tigations, and the identification of corpses when other methods have 
failed. . . 

Apparently the chief use of blood types m legal cases occurs when 
One can prove the impossibility of an allegation. For example, an 
O-type father and an AB-type mother cannot (under the informa- 
tion set forth above) have an O-type child. Or, as another case, if 
an accused person has blood stains on his clothing and claims that 
they resulted from his having had a nosebleed, the nime that his 
blood is A, M whereas the stains are A, N would disprove his claim. 


PROBLEMS 
a wife with type A blood and a husband with 
à whose blood types all are O? 
A blood and that his wife's blood group 
is AB. What is the most likely type of blood for be en child under the 
ni is Xr “pe is impossible? Ans. A, O. 

assumpti ade in Table 3.12? What type 15 . , 

3. D OR bem Es tiem 2 claim five children all of blood type B, would 
you doubt Te blood types or, perhaps, the alleged parentage? Give probability 
argument. 


1. What is the probability that 
type B blood will have three children 
2. Suppose that a husband has type 
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4. If a name is to be selected at random from among those persons who were 
residents of the United States in 1950 and who then were between the ages of 
35 and 74, inclusive, estimate the probability that the person so chosen will be 
aged 50 to 59, inclusive. (See problem 8, section 2.5 for the a MER 

ns. 26. 

5. Compute the probability of throwing either a sum of seven or of eleven 
on one throw of two unbiased dice by enumerating the single events in these two 
classes of events. Verify your answer by applying the Law of Total Probability. 

6. If five unbiased coins are to be flipped simultaneously, calculate the proba- 
bility that there will be a 3:2 division of heads and tails, either way. Ans. 5/8. 

7. Verify the probabilities given in the second and tenth lines of Table 3.12 
by listing all the possible combinations of chromosomes. 
of single events come into these calculations? 

8. Use the laws of Total and Compound Probabilities to solve problem 7. 

9. Suppose that two bags—identical in appearance—contain, respectively, 20 
red and 30 blue marbles; and 40 red and 10 blue marbles. If one bag is to be 
selected at random and then one marble withdrawn from that bag, what is the 
probability that it will be red? That it will not be red? 

10. If three unbiased dice are to be thrown once, what is the probability that 
a sum of 4 will be thrown? A sum of at least 4? Ans. 1/72, 215/216. 

ll. If the throw described in problem 10 is to be made 
probability that a sum of 4 will be thrown both times? What is the probability 
that exactly one sum of 4 will be thrown on the two throws? 

12. Suppose that two babies h 
tain hospital, and that one of t 
were interchanged either willfull 
babies and of the parents are 


Where does the matter 


twice, what is the 


Mr. Timoféef is A, MN, P+, and Rh+; 
Mrs. Timoféef is B, N, P+, and Rh—; 
Mr.Brown is B, M, P+, and Rh-; 
Mrs. Brown is O, N, P—, and Rh-. 


The child the Timoféefs now have is O, MN, P—, and Rh+. The child the 
Browns now have is O, MN, P+, and Rh—. Have the babies been inter- 
changed? Or is it impossible to tell from this information? Give reasons. 


Ans. No interchange has occurred. 
hy man has died a woman claims 
the deceased was the father. Also 
acts about blood classes have been established: 


8 blood was B, M, Rh+, and P+. 


had a son whose blood was in group O and also was Rh—. 
mother’s blood is A, MN, Rh+, and P+. 
8 blood is O, M, Rh-, and P 


(1) The deceased’ 
(2) The deceased 
(3) The alleged 
(4) The girl’ 


about the paternity of the girl? Justify your 
nce based on the following assumptions: (1) 
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to change this assumption. (2) Similarly, for type A. (3) The probability that 
a person who tests Rh+ is specifically Rh/rh is 2/3, and similarly for P+ if 
there is no other information available which would change the probability. 

14. Suppose that a tortoise (land turtle) is wandering at random on a 50 by 
50-foot lawn enclosed by a fence. He is equally likely to be on any partieular 
square foot of lawn one could designate in advance. What is the probability 
that at any specified future time he will be within 10 feet of the fence? If it is 
known that he is not within 10 feet of the south fence, what is the probability 
that he is not within 10 feet of any of the four fences? Ans. 16/25, 9/20. 

15. Ignoring the refinements, the Rh factor is inherited as described above. 
The discovery of this factor in 1940 led to an explanation of one type of infant 
mortality, erythroblastosis. In a large majority of the cases, the father is Rh+, 
the mother is Rh—, and the child is Rh+. Only a fraction of the cases wherein 
the child is Rh+ and the mother is Rh—, which are potentially erythroblastotic, 
actually result in trouble; but why some do and others do not is not presently 
known. Obviously, the father could be either Rh/Rh or Rh/rh, but the mother 
must be rh/rh. Assume that the population of potential parents is divided for 
each sex as follows: 


30 per cent RhRh, 60 per cent Rhrh, and 10 per cent rhrh 


What is the expected proportion of potential erythroblastotics among their 
children? 


32 PERMUTATIONS AND COMBINATIONS 


Probability has been calculated in such a way that two numbers 
need to be determined: (a) the number of single events in the class 
of events whose probability of occurrence is being determined, and 
(b) the total number of single events possible under the prescribed 
conditions or in the mathematical model. For example, the prob- 
ability of throwing a sum of seven with two unbiased dice is the 
ratio of the number of single events whieh give a seven to the total 
number of ways a sum can be produced. In this instance it is easy 
to determine those two numbers, but it is not usually easy. The 
determination of the necessary two numbers often is greatly facili- 
tated by the use of the mathematical concepts, permutations and 
combinations. In the process of introducing these concepts, it is con- 
venient to develop certain useful formulas in terms of abstract let- 
ters. Thereafter, these symbols will be employed to represent per- 
sons, heads and tails on a coin, physical objects, ete. 

A set of letters, such as ABC, can differ from another set of the 
same number of similar marks in m or both, of ^: CEA a sue 
letters may appear in a different order, Or exactly the same letters 
may not be Lu in both sets. For example, ABO and ACB are 
different orderings of the same three letters, whereas ABC and BCD 
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involve different letters. Two sets of three (or of n) letters are said 
to be different permutations of letters if they differ in either the order 
in which the letters are arranged, or if some different letters are in- 
volved in the two sets. Two groups of n letters are considered to be 
different combinations of letters only if some different letters are 
included in the two sets. A reordering of the same letters forms a 
new permutation but not a new combination. Hence, just one com- 
bination can be formed from n given letters if all the letters are used 
at once. This question then arises: How many different permuta- 
tions can one form from n letters, using all n letters cach time? 

The process of constructing a permutation consists in determining 
a first, a second, ..., and finally an nth letter. The first letter is 
chosen from among n, the second from among the remaining (» — 1) 
letters, the third from among the (n — 2) then remaining unchosen, 
ete., until finally only one letter is left for the nth choice. These n 
choices can be made in n(n — 1) (n — 2) (n — 3) ... (2) (1) ways, 
which is then the number of different permutations possible with n 
letters if all n of them are used in each permutation. To illustrate, 


suppose that there are three letters: A, B, and C. The following out- 
line shows how the choices can be made: 


For first letter, A, or B, or C. 
For second letter, BorC AorC A or B. 
For third letter, Cor B CorA B or A. 


"Therefore the permutations are: ABC, ACB, BAC, BCA, CAB, CBA, 
and there are 3(2) (1) — 6 of them. 

It is convenient to denote the product n(n — 1) (n — 2) ... (2) (1) 
by the symbol n!, and to call it n factorial. Hence, if Pa, n is adopted 


as the symbol for the number of permutations of n marks arranged in 
sets of n, we have 


(3.21) Tox es ai 


as the formula for computing the number of such permutations. If 
n = 3, as above, Pz, 3 = 3! = 3-2-1 = 6, as before. 

More often, it is necessary to make up permutations of marks in 
which only r of the n marks are used at any one time. 
we might be choosing a batting order of 9 men from a s 
than 9 men. To see how the process goes, suppose that 
to make up all the possible two-letter permutations fro; 
A, B, C, and D. There are 4 choices for the first letter 
for the second; therefore, there are 4(3) — 12 possible c 


For example, 
quad of more 
it is required 
m the letters 
and 3 choices 
hoices of two 
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from among four. In general, the symbol P,,, stands for the num- 
ber of r-letter permutations which can be formed from m letters. 
Then, for n = 4 and r = 2, P4,» = 4(3) = 12, or, in general, 


(3.22) Pa, y, = n(n = 1)(n — 2) --- (n —r-+ 1). 


It is possible and useful to express P, y in terms of factorials. To 
do so, we deliberately create the factorials from formula 3.22 by 
multiplying and dividing by (n — r) (n — r — 1) ... (2) (1) to make 
the numerator into n! and the denominator into (n — r)!. The final 
result is 

n! 


(3.23) Par= nat 


From the definitions of permutations and combinations it follows 
that every set of r letters can be formed into but one combination, 
using all r of the letters; whereas, r letters can be formed into r! 
permutations. Hence, it is concluded that there are r! times as many 
permutations of n marks taken r at a time as there are different 
combinations of n letters taken r at a time. Therefore, if the symbol 
Ca, y is adopted to indicate the number of possible combinations of 
n letters taken in groups of r letters each, the formula for that num- 


ber is whichever of 


(3.24) C — &(n— D(n 2) nrt) 


or 
r! 


n! 


ri(n — r)! 


(3.25) Cnr = 


we wish to employ. 

Problem 3.21. In how many different orders can 4 cars be parked among 6 
consecutive parking places along a curb? 

It should be clear in this situation that the order in which the cars 
are parked makes a difference because the different orders are dis- 
tinguishable, and would be considered as different by a policeman 
checking parking. Therefore, this is a problem in numbers of permu- 
tations and can be worked by either formula 3.22 or 3.23. By 
formula 3.23 

6! 6-5-4321 — 


= 360. 
(6-49)! 2-1 
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Problem 3.22. How many different (as to cards held) 5-card poker hands are 
possible from the usual 52-card deck? 


In view of the fact that the order in which the cards were dealt 
does not affect the actual cards held, this is a problem in numbers 
of combinations of 52 objects taken 5 at a time; hence the number 
of poker hands is Cs», ; = (52!)/(5!47!) = 53,040 after common fac- 
tors in numerator and denominator are divided out and the remain- 
ing factors multiplied together. 


Problem 3.23, What is the probability that 5 cards dealt from a well-shuffled 
poker deck will all be spades? 


Two numbers need to be determined before formula 3.11 can be 
applied: (1) the total number of 5-card hands which are all spades, 
and (2) the total number of 5-card hands of any sort which possibly 
could be dealt from the deck. In view of the fact that the order in 
Which the cards were dealt is unimportant, this is a matter of finding 
numbers of combinations, namely, Cis, 5 and Cs» ;. Therefore, the 
required probability is 


P(all spades) = C13, 5/C55, 5 = 0005, or 1 chance in 2000. 


Problem 3.24. What is the probability that 5 cards dealt from a well-shuflled 
poker deck will include exactly 3 aces? 


"Three aces can be chosen from among the 4 available in C4, 3 or 
4 ways. Likewise, Cas, 2 = 1128 is the number of different pairs of 
cards which do not include any aces. All possible 5-card hands with 
exactly 3 aces must necessarily be the same as all the possible ways 
to put some 3 aces with one of the 1128 pairs of cards which are not 
aces; hence there must be 4(1128) = 4512 different 5-card hands 


which include exactly 3 aces. Therefore, the probability of being 
dealt such a hand is 


P(exactly 3 aces, 2 non-aces) = 4512/052, 5 


-0016, or 1 chance in 625. 


PROBLEMS 


1. In how many ways, which 
can 4 men and 4 women be sea 
seated alternately? 

2. Suppose that the 
lowing blood types: 1 


differ as regards the persons in particular chairs, 
ted around a dinner table, with men and women 


re are 10 persons in a room, and that they have the fol- 
is AB, 3 are A, 2 are B, and 4 have type O blood. If 2 


Sec. 33 REPEATED TRIALS 67 


are chosen at random what is the probability that they will have the same type 
of blood? Ans. 2/9. 

3. Suppose that a baseball team has 4 men who can bat in any of the first 
3 positions, 5 who can bat in any of the fourth, fifth, and sixth positions, and 
7 who can bat in any of the last three positions. How many possible batting 
orders are there? 

4. Assume that 7 insecticides are to be tested as to their effectiveness in 
killing house flies. If each spray is to be tested against every other spray once 
in a separate test, how many tests will this require? Ans. 21. 

5. Suppose that a housewife buys 3 cans of peaches, 6 cans of apricots, and 4 
cans of pears; and suppose that her child tears off all the labels on the cans. 
lf the housewife needs 2 cans of fruit for dinner, what is the probability that 
the first 2 cans chosen will contain the same kind of fruit? 


6. How many 13-card bridge hands are there with no card higher than 8? 
Ans. 37,442,160. 


7. If 7 unbiased coins are flipped simultaneously, how many single events are 


there in the class: 3 heads, 4 tails? 
8. Compare the coefficients of (24 y)9 with Cow Cs, 4> Cog Cs, m CE 


and C, y given that 01-2 1. 
9. What is the probability that 5 cards dealt from a well-shuffled poker deck 


will include 3 queens and 2 aces? Three queens and at least 1 ace? 
10. What is the probability that 13 cards dealt from a well-shuffled bridge 
deck will include exactly 8 honor cards (honor cards are 10, J, Q, K, and Ace)? 
Ans. 040. 


11. In how many ways can 6 boxers be paired off for 3 bouts being held 


simultaneously? 


33 REPEATED TRIALS UNDER SPECIFIED 
CONDITIONS 


Situations involving the numbers of occurrences and non-occur- 
rences of an event E on repeated trials under the same original con- 
ditions are of particular interest in statistical analysis. The prin- 


ciples involved will be seen to be important to the study of frequency 


distributions, and to sampling studies. i 
The probability problems created when trials are repeated under 


fixed conditions can be illustrated by means of mathematical models 
of these problems. Suppose that a coin is flipped n times and the 
number of heads noted. On such a set of repeated trials any number 
of heads is possible from 0 to n, that is, there are (n + 1) possible 
classes of event: O heads, n tails; 1 head, (n — 1) tails; 2 heads, 
(n — 2) tails; ... ; n heads, 0 tails. Each class of events includes 
some number of single events (if the coin is unbiased) from 1 to 
whatever the maximum size of Ca, r is for the given n. For example, 
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if an unbiased coin is tossed 5 times, there are 6 classes of events, and 
the specific single events in the class 4H, 1T are 

Toss 


First Second Third Fourth Fifth 


an H H H H 
H vig H H H 
H H T H H 
H H H T H 
H H H H r 
5! 
It is seen that C5, 4 = mi” 5 = the number of single events in the 


class which includes exactly 4 heads. It also should be observed that 
the outcome of each toss is an independent event relative to the out- 
come of any other toss; hence the probability of the first result listed 
above, THHHH, is (1/2)(1/2) --- (1/2) = (1/2)5. With this un- 
biased coin, that also is the probability for any of the other single 
events in this class of events. Therefore, the probability of an event 
in the class 4H, 1T is Cs, 4(1/2)* (1/2)! , in which the exponent 4 refers 
to the number of heads and the exponent 1 refers to the number of 
tails. The reader can verify the fact that for any specified number of 
heads from 0 to 5 the probability of exactly r heads is C5, ,(1/2)" 
X (1/2)777, where r takes any value from 0 to 5. 

Tn general, if » unbiased coins are to be flipped (or one such coin be 
flipped » times) the probability of the appearance of any specific 
number of heads, say r, is 


(3.31) Plr heads, (n — 7) tails] = Cn, ,(1/2)'(1/2)^—7, 


To extend this result a bit, let an event E have a constant prob- 
ability, p, of occurrence on each of n repeated trials. Then the 
probability that E will occur on exactly r of the trials [and fail on 
the other (n — r) trials] is given by the following formula: 


(3.32) P[r E's, (n — r) not-E's] = Cn, «(p)' (1 — p)". 


The student can verify that this formula becomes (3.31) if p = 1/2, 
E — H, and (not-E) — T. 

One more generalization can be obtained regarding formulas 3.31 
and 3.32 by considering the expansions of the two binomials 
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(1/2 + 1/2)” and (q +p)", in which q—1-— p. To see these 
generalizations, consider the following binomials expansions: 


(1/2 + 1/2? = 10/2)! 1/2? + 20/2)! 1/2) + 1(1/2)?(1/2)°, 
C5, o(1/2)°(1/2)? + C2, 1(1/2) 1/2)! 

+ Cs, 2(1/2)?(1/2)°, 
= P(0H, 27) + PQH, 1T) + P(2H, 07). 


That is, the successive terms of the expansion of (1/2 + 1/2)? are 
given by formula 3.31 if r = 0, 1, and 2, successively; and those 
three terms give the probabilities for the three possible classes of 
events in terms of the number of heads appearing. The generaliza- 
tions for 3, 4, ..., or n tosses should be apparent. For the more 
general situation in which the probability of the occurrence of an 
event E is constantly p under repeated trials, 


(a + p)? = 100 + 2(9) (2 + 1()* (9^, 
P(0 E's, 2 not-E's) + P(1 E, 1 not-E) 
+ P(2 E's, 0 not-E's); 


and again it should be apparent that these successive terms corre- 
spond to formula 3.32 for r — 0, 1, and 2, successively. 


ll 


i 


PROBLEMS 


1. What is the probability that if 6 unbiased pennies are tossed simulta- 


neously, exactly 3 heads will appear? . i 
2. What is the probability that at least 3 heads will appear under the condi- 


tions of problem 1? Ans. 21/32. 

3. If one parent is Rhrh and AO, and the other parent is rhrh and BO, what 
is the probability that both their first two children will be Rh— and AB? 

4. Suppose that a sample of 100 bolts is taken from a very large batch which 
contains exactly one-half of 1 per cent of unacceptable bolts. What is the 
probability that at least 2 bolts in the sample will be unacceptable? Ans. 09. 

5. If 5 bolts among the 100 in the sample of problem 4 are found to be un- 
acceptable products, what would you conclude about the hypothesis that only 


one-half of 1 per cent were faulty in the whole batch? Give reasons. 
6. Write out the series for (x + y)* and show that the coefficients are num- 


bers of combinations, C4, p with r=0 to 4. ' 

7. Suppose that the teams listed on a football parlay card are so handicapped 
that you actually have a 50-50 chance on each team you pick. What is the 
probability that you will pick exactly 9 winners out of n Would this proba- 
bility justify odds of 25 to 1 for this accomplishment? What about odds of 


250 to 1 for getting 10 out of 10 correct? 
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8. If a pair of unbiased dice is to be thrown 6 times in succession, what is the 
probability that exactly 3 sevens will be thrown? What would you think the 
most likely number of sevens would be? Ans. 054. 

9. If a certain manufacturing process is producing machine parts of which 10 
per cent have some serious defect, what is the probability that all of the 10 
parts chosen at random will be acceptable (that is, have no serious defect)? 
How many would you have to take in the sample before the probability of all 
being acceptable will be no greater than .05? 


10. Graph f(p) = (1 — p)1, and relate this graph to problems like problem 9. 


3.8 MATHEMATICAL EXPECTATION 


The diseussions earlier in this chapter have involved the occur- 
rences of chance events as a result of what have been termed 
"trials" under specified conditions. The outcome of a trial is de- 
scribed in one of two general ways: (a) Something happens a certain 
number of times on a specified number of trials, or (b) we simply 
note whether or not an event E has, or has not, occurred and asso- 
ciate with that occurrence some value, say a financial loss, as in in- 
surance. With either type of situation it may be important to be 
able to predict what will be the average outcome of trials under the 
stated conditions, over the long run of experience. For example, an 
insurance company needs to know what amounts it should expect 
to have to pay out in death benefits during a particular period of 
time, one year, for instance. 

In case a, the prediction needed is to be presented in the form of an 
expected number of occurrences of an event E on a set of n future 
trials. A formula for this expected number can be justified heuristi- 
cally as follows. If the probability of E is p, the p is just the fraction 
of the time that E should occur over many trials. Hence, if there are 


to be n trials, it is reasonable to say that the expected number of 
occurrences of E on n trials is 


(3.41) Expected number = E(r) = p- (n). 


Problem 3,41, If 6 unbiased coins are to be tossed simultaneously, what is the 
expected number of heads? 


In this circumstance p = 1/2 and n = 6; hence the expected (or 
long-run average) number of heads is Æ (r) = (1/2) (6) = 3, Actu- 
ally, our intuition would lead to the same conclusion. 


Problem 3.42. Suppose that an insurance company has insured 50,000 persons 


who are each 30 years old, and that records from past experience show that 
6/1000 of such persons die before reaching the age of 31. What is the expected 
number of deaths during the first year of the insurance contract? 
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For this situation p = .006, n = 50,000; therefore the mathemat- 
ically expected number oí death benefits among those thirty-year- 
olds is E(r) = .006(50,000) = 300. The reader will realize that it 
would be unsound financially for the company to be prepared to pay 
only 300 death benefits because this time the number of deaths might 
be considerably higher. All that is being said is that over a period 
of years of such calculations the average number of deaths among 
thirty-year-olds in this same insurance class will be very close to 300. 

When chance occurrences are of the type b described above there 
may be associated with the occurrence of E some value, say a finan- 
cial gain or loss. Then we may wish to predict the loss or gain to 
be expected on the average under the given conditions. For example, 
suppose that you are going to roll a pair of unbiased dice and are 
to be paid 60 cents if you get a sum of 7. How much should you 
pay to play such a game if you just wish to break even? Obviously 
you will receive either 60 cents or zero cents after each game; but 
over many games what will be your average winnings per game? 
That is the amount you can pay and break even. Because the prob- 
ability of throwing a sum of 7 is 1/6 you expect, mathematically, to 
win 60 cents on about one-sixth of your throws and to win zero cents 
on the other five-sixths of the throws. Hence, the mathematical 


expectation logically is d 
(1/6)- (60 cents) + (5/6)- (0 cents) — 10 cents. 


"Therefore you can expect to break even in the long run if you pay 


10 cents to play each game. ; 
The game just described can be extended to include a reward of 


90 cents if you throw a sum of 11 on the two dice. In this circum- 
stance you can win in either of two mutually exclusive ways, that is, 
you can throw a 7 or an 11. Therefore attention is centered on three 


classes of events and the corresponding rewards: 
with a reward of 60 cents, 


a reward of 90 cents, and 
eward of 0 cents. 


A sum of 7 
a sum of 11 with 
a sum other than 7 and 11 with a r 
umber of games you will tend to win 60 
cents on one-sixth of the throws, 90 cents on one-eighteenth of the 
throws, and 0 cents on the other seven-ninths of the throws. Hence 
your mathematical expectation on this game is (1/6) (60 cents) + 
(1/18) (90 cents) 4- (7/9) (0 cents) — 15 cents, because that is the 


"Therefore, over a large n 
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average winnings per game and is, therefore, the amount you could 
pay to play this game and expect to break even. 

The preceding ideas and methods can be generalized and sym- 
bolized in the following manner. Let all the single events possible 
under a specified set of conditions be grouped into s mutually ex- 
clusive classes of events. Let 2; be the reward, loss, or in general 
the “value” of the occurrence of an event in the ith class; and let 
pi be the probability that an event in the ith class will occur on any 
designated future trial under the stipulated conditions. Finally, let 
E(x) stand for the total mathematical expectation under the given 
conditions. It follows from the reasoning outlined above that 


(3.42) E(x) = piri + pora ++ +++ pets = Y) (pix). 


i=l 


PROBLEMS 


1. Suppose that you are 20 years of age and that you are to inherit $10,000 
at the age of 30 if you are alive then. What is the expected value of this in- 
heritance if you have a probability of 92 of living to be 30 years of age? (This 
probability is derived from the American Experience Mortality Table.) 

2. It is approximately true that brown and blue eye colors are inherited in 
a manner similar to that explained for the A-B blood groups. If b/b = blue 
eye color and either B/b or B/B = brown eye color, what is the expected num- 
ber of blue-eyed children among 500 from parents who are B/b and b/b, re- 
spectively? Ans. 250. 

3. Answer the same question as in problem 2 for parents who are both B/b. 

4. If in each three-month period 1 car in 20 of the type which you drive has 
an accident costing an average of $75 for repairs, how much insurance against 
such a loss should you pay each quarter if you allow the company 15 per cent 
beyond mathematical expectation for handling the business, and if you ignore 
interest on your money? Ans. $4.31. 

5. How much would one be justified mathematically in wagering against one 
dollar that on 10 throws of two unbiased dice a sum of 7 will appear less than 
3 times? 

6. Suppose that you have the choice of receiving $10,000 at age 65 if you are 
alive then, or of taking a cash payment now. From a purely mathematical 
point of view and ignoring interest on money, what should the size of the pay- 
ment be if your probability of living to be 65 is .56? Ans. $5600. 

7. Suppose that a concession at a fair offers a 50-cent prize if you pay 10 
cents for 3 throws and knock down all of a stack of milk bottles on the 3 
throws. Suppose also that you have 1 chance in 10 to knock down the bottles. 
If the operator of the concession has to pay $75 per day for the privilege of 
doing business there, how many customers must he have per day in order that 
he can expect (mathematically) to make some money? 

8. Suppose that a person who is 40 years of age is to reci 
his sixtieth, sixty-first, and sixty 
Also suppose that interest on 


i i eive $1000 on each of 
-second birthdays if he is alive to receive them. 


money is to be ignored. Given that his proba- 


— = 
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bilities of living to those successive ages are .74, 72, and 40, what should this 
person pay for an annuity of this sort if the company is allowed 15 per cent for 
overhead? Ans. $2484. 


REVIEW PROBLEMS 


1. What is the difference between a population of numerical measurements 


and a sample of such measurements? 
2. What was Political Arithmetic? With what sorts of problems were the 


political arithmeticians mainly concerned? ; 
3. Who was Student and how was his work connected with present types of 


statistical problems? 

4. Expand (1/3 + 2/3)4 into a series and state specifically what probability 
is given by each term if p = 2/3 is the probability that a certain loaded penny 
will turn up heads on any particular future throw. Describe the mathematical 
procedure needed here to define the single events. 

5. What is a frequency distribution? A relative cumulative frequency dis- 
tribution? 

6. Suppose that 
on the indicated tests: 


a college freshman has earned the following percentile ratings 
(a) general intelligence, 90; (b) achievement in social 
sciences, 65; (c) achievement in physical sciences, 92; (d) achievement in 
What can you say about the student’s probable future success 
athematics, history, and sociology if it is 
hy and if no serious personal problems 


mathematics, 95. 
in courses in chemistry, physics, m 
assumed that the tests are trustwort! 


interfere? AES e - T 
7. Given that for a set of numerical measurements Xy Xo, +.» X59, EX = 95 
and 3(22) = 2.06, calculate the coefficient of variation. 


8. Calculate the geometric, arithmetic, and harmonie means of 1/2, 2, and 8 
and discuss the choice of the best average for these. numbers. Ans. 2, 3.5, 8/7. 
9. Suppose that the following probabilities regarding football games have been 
or prr " 2 a = 1/2; a ; 
determined reliably: 4 to beat B. py = 2/3; C to ko Sl Pa : 1/2; and E to 
beat F, pg = 5/6. What are the odds that A, D, anc all win 


i r. re for a certain group of scores, 
iv n raph below is the r.c.f. curve 1 1 
10. Given that the graph La no Mp mini o 


etermine fr i o dian, t Je 4 
s e s A dn Also interpret these results Statistibally, with 
some indication of the uses to which such information can be put. 

a Ans. 31.5, 41.0, 35.0. 


Relative cumulative frequency 
in 
o 


al 
5 30 35 40 45 50 55 60 65 
0 5 10 15 20 2 Scola 
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11. What proportion of the scores summarized in problem 10 lay between 40 
and 60 inclusive? What proportion exceeded 60? 

12. What is the probability that on 15 flips of an unbiased penny one will get 
either 7 or 8 heads? Will get neither 7 or 8 heads? Ans. 39, 61. 

13. Suppose that for a certain strain of chickens the probability that a late- 
feathering chick will be hatched from any egg selected at random is 1/16. What 
is the expected number of such late-feathering chicks among 800 newly hatched 
chicks? 

14. If a pair of true dice is rolled 60 times, what is the mathematically ex- 
pected number of sevens? Of either sevens or elevens? Of sums greater than 9? 

Ans. 10, 18%, 10. 

15. Assume that the semester grades in a large chemistry class have the 
ogive graphed below. If the letter grades are to be distributed as follows: 
7 per cent A, 20 per cent B, 46 per cent C, 20 per cent D, and 7 per cent F, 
what are the grade ranges covered by each letter grade? : 


1.00 
90 


Relative cumulative frequency 
a 
o 


30 40 50 60 70 80 90 100 
Grade 


16. What is the median numerical grade for the data of problem 15 above? 
What are the upper limits of the quartiles? Ans. 69; 59.5, 69, 77, 100. 
17. Suppose that 6 unbiased pennies are to be tossed simultaneously. What 


is the probability that no more than 2 will show heads? That at least 2 will 
turn up heads? i 


18. Assume that the true odds on each of 3 
are determined to be as follows: horse A, 3:2; 
What is each horse's probability of winning? 
some one of these 3 horses will win? 

19. Given that for three separate statistical popul 
71—4; H= 50, og — 5; and Hg = 100, 0, = 13. 
you consider as relatively the more variable? 
dence to back your answer. 

20. Compute the mean deviation and the Standard dev 
lowing data: 13, 9, 10, 17, 15, 20, 11, 5, 2, 10, 14.13. 15 21.1 
17, 15, 15, 18, and 2. Which measure of v; deis 
the dispersion of these data about their ar: 
are given that 2X = 338, xX? = 5406. 


horses to win a partieular race 
horse B, 1:3; and horse G, 1:9. 
What is the probability that 

Ans. 60, 25, .10; .95. 
ations of data: 4 = 25, 
Which group of data would 
Give specific statistical evi- 


iation for the fol- 
14, 6, 8, 14, 6, 3, 29, 16, 
ariation do you think best describes 
ithmetie mean? Give reasons. You 
Ans. 4.92, 6.24. 
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21. According to Figure 2.41, in which quartile would you place a score of 85? 
In which decile would this score fall? 

22. If all students whose ACE scores (Table 2.01) fell among the lower 15 
of all scores made were to be advised to consider 


seriously dropping out of college, what would be the highest score whose re- 
cipient would receive such advice? Ans. 67. 

23. From Figure 241 determine approximately the percentage of those scores 
which were not more than one times the standard deviation either greater than 


or less than the mean, A. 7 
24. In the game of “craps” two unbiased dice are thrown successively by the 
he throws a sum of either 7 or 11 on his first 


same person. He wins if: (a) E 
throw; or (b) he throws a 4, 5, 6, 8, 9, or 10 on the first throw and repeats his 


number on a subsequent throw before he throws a 7. What is his probability 


of winning within 2 throws? a À Ans. 97/324. 
25. In a certain gambling game you are paid 15 for 1 if you throw a 1 and a 2 


(either order) on two unbiased dice. On 1800 games on each of which the 


player pays one dollar, what is the expec 
relative to the amount taken in? 


per cent, approximately, 


ted percentage profit for the house 
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CHAPTER 4 


The Binomial and Normal 
Frequency Distributions 


The discussions and illustrations of Chapter 2 involved situations 
in which groups of measurements (usually numerous) had been 
taken under specified conditions, and we had in mind only an efficient 
summarization of the data. The ACE scores in Table 2.01 were 
Cited as an example. In a sense, we simply took what we got and 
thereafter applied statistical methods to reduce a bulk of data to a 
more comprehensible form without losing any essential information. 
More generally, however, populations of numerical measurements 
must be studied by means of samples because so many measurements 
are involved that it is not feasible, efficient, or even possible to obtain 
and to analyze the whole of the population. 

Two different types of populations will be eonsidered. In one type, 
the chance variable will be a qualitative one such as male or female, 
dead or alive, own an automobile or do not own an automobile. The 
population will consist of individual members, each falling into one 
of just two classes according to the qualitative designation adopted. 
The other type of population to be considered will be based upon 
à variable whieh is measured along a continuous scale, such as the 
weight of an individual, the volume of a gas, or the bushel yield of 
a variety of wheat. 

As regards populations in which a qualitative variable is used, 
attention herein will be confined to what is called a binomial popula- 
tion because each member of the population falls into one of only two 
classes. The proportion of a binomial population which 
one of the two classes will be measured by the fr. 
fraction falling into the other class to be 1 — p 
if all the babies born in New York City during a given year were 
to be classified as male or female, p might be the fraetion who were 
males. If p = .51, then q = 1 — .51 = 49. The sex would be the 
qualitative variable mentioned above, and has but two "values": 

76 


belongs to 
action p, leaving the 
=q. For example, 
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male and female. If a baby were to be chosen at random from 
among those born in the specified year, its classification as male or 
female would be a member of this binomial population. Under the 
above assumptions, the probability that such a selection will turn 
out to be male is p — .51. 

If n repeated observations, or trials, are made on a binomial pop- 
ulation in which the proportion p is staying fixed, and if attention is 
fixed upon the number of individuals in each of the two classes, these 
numbers are variable from one set of n trials to another. For exam- 
ple, it was noted in Chapter 3 that the probability that r males, say, 
and n — r females would be observed is given by Cn, ,(p)' (1 — p)"=". 
In other words, r is a chance variable. The relative frequencies with 
which r will have the values 0, 1,2, ... , and n after a great many sets 
of n random trials from a binomial population constitutes a binomial 
frequency distribution. This distribution will be of more direct in- 
terest to us than the binomial population in itself because the binomial 
frequency distribution describes results which are obtained in the 
process of sampling à binomial population. 

There are many types of populations for which the random vari- 
able is of the second type discussed at the beginning of this chapter, 
namely, a measurement referred to a continuous scale, such as weight. 
Probably, the most important populations of this sort are those called 
It will be convenient to describe this type of 


normal populations. 
al formula for its frequency 


population by means of a mathematic 
distribution. This will be done in a later section. 
Tt seems obvious that we cannot possibly learn much by sampling 
a population which cannot be clearly and concisely described; hence 
there is need for a mathematical description, or classification, of 
populations. We choose to study types of populations by means of 
their frequency distributions because that—or something equivalent 
—constitutes the fullest description we can obtain for a particular 
population. As noted above, the discussion in this chapter will be 
devoted to two of the most important types of frequency distribu- 
riate to qualitative measurements 


tions: one, the binomial, is approp” k à > 
of a certain kind; the other, the normal, typifies continuous numerical 
met in practice, Between 


measurements of types quite frequently 
these two theoretical distributions, à great many of the uses of 
statistieal analysis will be introduced. 
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41 THE BINOMIAL FREQUENCY DISTRIBUTION 


As was stated above, a binomial population and the corresponding 
binomial frequency distribution are involved when every single event 
which ean occur under prescribed conditions must belong to one of 
two classifications. This fact corresponds to the meaning of the 
prefix bi- in the word binomial. For example, if you take out a term 
insurance policy for a period of 10 years you and the company are 
interested in your subsequent classification as “dead” or “alive” 
before, or at, the end of the 10 years. Of course, the company in- 
sures many persons and regards them as a group, some of whom 
will be classifiable as “dead” and the remainder as “alive” at the 
expiration of the 10-year term. What the insurance company and 
its clients need to know, then, is this: Given a group of n persons 
insured for a 10-year term, what are the probabilities associated with 
each of the possible numbers of “dead” and “alive” insured persons 
during the 10-year period of the insurance contract? For any spe- 
cific n the relative frequency—over a great deal of experience—of the 
occurrence of 0, 1, 2,3,..., (n — 1), or all n “dead” after 10 years 
will be the binomial frequency distribution mentioned above. It is 
upon the basis of this sort of information that insurance premiums 
are calculated. i 

Suppose, for simplicity, that a company has insured 10 persons 
who are 30 years of age for a 10-year term. What can the company 
expect to pay out in death benefits? It is obvious that at the end 
of the 10-year period any one of 11 events may have occurred. There 
can be 0, 1, 2, ..., or all 10 classified as “dead.” Also, over the 
experience of many such groups of contracts for 10-year periods those 
11 possible outcomes will occur with unequal relative frequencies 
which depend both on the number, such as n = 10, and on the prob- 
ability of death for persons in this age interval. Clearly, this bi- 
nomial frequency distribution depends on n and on p = probability 
of death between the ages of 30 and 40 years. ‘ 

No one can state theoretically what the probability of death is for 
any particular person during the age period of 30 to 40 years; but 
tables have been compiled from experience which give the best avail- 
able estimate of the desired probability. For example, the American. 
Experience Table of Mortality indicates that the average probability 
is approximately one-tenth that an insurable person (determined by 
examination before the company will insure) now 30 years of age 
will die before he is 40 years of age. 
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If 10 persons are insured under conditions to which the American 
Experience Mortality Table applies, it follows from the discussion 
of section 3.3 that the 11 corresponding probabilities of occurrence 
of these numbers of deaths are given by the successive terms of the 
following binomial series: 


(9 + .1)!° = (9 + 10(9)? CD! + 45(.9)8(.1)? 4---- (1)? 


It is not necessary to devise some game with p = .1 and discover from 
experience that a fraction (.9)* of the trials will show no occurrences 
of the event E because the only interest is in the relative number of 
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Figure 411. Graph of the binomial frequency distribution for p — 1/2 and 
n= 10. 


t the probability gives. Hence, the above 


occurrences, and that is wha o 
istribution of the eleven possible classes 


series gives the frequency d 
of events. 
To re-illustrate th 


example which the reader can repr SEIEN c : 
show how to graph a binomial frequency distribution, attention again 


is called to a mathematical model, Suppose that an unbiased coin 
is to be flipped 10 times and the number of heads is to be recorded 
after each set of 10 throws. In these circumstances, n = 10, p = 1/2, 
and q = 1 — p = 1/2; hence the successive terms of the following 
binomial series give the probabilities for 0, 1, 2, 3, ..., or 10 heads 
on any future set of 10 throws: (1/2) + 10(1/2)* (1/2)? + 45 (1/2)* 
(1/2)? 4:4 (1/2))9; or 1/1024 4- 10/1024 4- 45/1024 + ee 
1/1024. In view of the fact that each of the denominators is 1024, 
we obtain a useful and simpler expression for the relative frequency 
of occurrence of 0, 1,2, ..-, OF 10 heads on 10 throws, by using only 
the numerators. From them a graph can be constructed to depict 
the relative frequency for each possibility, as 1s done in Figure 4.11. 


e discussion of the preceding paragraphs with an 
oduce easily and, in addition, to 
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This figure also can be described as the graph of the binomial fre- 
quency distribution when » — 10 and gn = 1/2. 

It is apparent that the actual form of a binomial frequency dis- 
tribution depends upon two numbers, n and p. If p = 1/221-— p, 
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[m 
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Number of occurrences (r) 
(p- 2/9) 
Figure 4.124. Graph of the binomial frequency distribution with p = 2/9 and 
n — 10. 
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(p- 13/16) 
Figure 4.12B. Graph of the binomial frequency distribution with p = 13/16 and 
n= 10. 


the graph is symmetrical, as in Figure 4.11. If p > 1/2, the event E 
is more likely to oceur than to fail to occur; hence the higher ordinates 
of the graph will be toward the right-hand side of the graph. If 
P < 1/2, the reverse situation is expected. These remarks are illus- 
trated in Figures 4124 and B. For Figure 4.124, D — 2/9; and for 
Figure 4.12B, p = 13/16. In both cases n = 10. The series for the 
binomials (7/9 + 2/9)*° and (3/16 + 13/16)" were employed in the 
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constructions of these figures, using only the numerators of the terms 
as explained above. 

The ».c.f. distribution for a binomial situation is discontinuous— 
as is expected—and involves successive ordinates, each at least as 
large as the preceding one to its left on the graph. Such a graph is 
shown in Figure 4.13. If we were to draw a smooth curve through 
the tops of the ordinates, it would have the same general appearance 
as the r.c.f. curves drawn in Chapter 2. 

Fundamentally the frequency and r.c.f. tables corresponding to 
Figures 4.11 and 4.13 are as shown in Table 4.11. The meaning and 


TABLE 4.11 


Frequency AND r.cf. DISTRIBUTIONS FOR THE BINOMIAL DISTRIBUTION 
DzrmED ny p = q = 1/2, n = 10. Toran Frequency Taken = 1024, THE 
Sum or THE NUMERATORS OF THE SERIES FOR (1/2 + 1/2)" 


Number of Oc- 


currences of E 

T f ag. mej. 

10 1 1024 1.000 
9 10 1023 .999 
8 45 1013 .989 
7 120 968 .945 
6 210 848 .828 
5 252 638 .623 
4 210 386 .377 
3 120 176 .172 
2 45 56 .055 
1 10 1 .011 
0 1 1 .001 


z(f) = 1024 


use of Table 4.11 are fundamentally the same as for similar tables 
in Chapter 2, but some differences should be noted. The major dif- 
ference arises from the fact that the class “intervals” now are just 
isolated points on a scale of measurement appropriate to r. For 
example, 514 per cent (0.055) of the observed values of r (over a very 
large number of observations on 7) will be at or below r -" 2. How- 
ever, these observed numbers of occurrences of E will be 2 s, Us, and 
0’s only: there is no such r as 1.6, for example. Another difference 
between Table 4.11 and similar tables in Chapter 2 is that the former 
is a theoretical table which fits any situation for which p = 1/2 and 
n = 10. The frequency tables in Chapter 2 were relevant only to the 
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particular situation which produced the data summarized in a given 
table. 

We might wish to know what the median r is for a binomial dis- 
tribution. By Table 4.11, 37.7 per cent of the numbers are seen to 
be 0’s, 1's, 2’s, 3's, and 4’s. If 5’s are included, the percentage runs 
past 50 (needed for the median) to 62.3; therefore the median r 
must be 5. Itis not some decimal fraction between 4 and 5 because 
no such numbers even exist on the scale of measurement of r. 
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Figure 4.13. The 7.c.f. distribution for the binomial distribution with p= 1/2 
and n = 10. 


The median of the binomial distribution just considered also can 
be obtained from the r.c.f. distribution of Figure 4.13 by reading 
horizontally from the point where r.c.f. = .50 until we come to the 
first ordinate on the left, which is high enough to be intersected by 
the horizontal line from ref. = .50. 

It is interesting to compare a frequency distribution which was 
obtained by actual trials with that which would be expected mathe- 
matically under the specified conditions. This is done approximately 
in Table 4.12 for a situation in which 5 pennies were flipped 2000 
times. It was assumed that the pennies were unbiased, although 
this is known not to be strictly true for any actual coin. It should 
be apparent from previous discussions that the mathematically ex- 
pected proportions of the 6 possible combinations of heads and tails 
listed in column 1 of Table 4.12 are 1:5:10:10:5:1. The resulting 
expected numbers of occurrences of each of the possibilities are given 


to the nearest whole number under the heading “Exp.” in columns 
3, 5, 7, and 9. 
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TABLE 4.12 


COMPARISON Or OBSERVED AND EXPECTED Frequencies OF Heaps (H) AND 
Tarts (T) Waen 5 PENNIES (ASSUMED Unsrisep) Are Furrrep 100, 500, 
1000, anD 2000 Times 


Gombinasian: 100 Throws | 500 Throws | 1000 Throws | 2000 Throws 
of H and T 
r, (n — r) Obs. | Exp. | Obs. | Exp. | Obs. | Exp. | Obs. | Exp. 


5H, 0T 4 sa esu tie || 83: || S | 631 623 
4H, 1T 10 | 16 | 71 | 78 | 135 | 156 | 293 | 312 
3H, 2T 33 | 31 | 151 | 156 | 287 | 313 | 600 | 625 
2H, 3T 31 | 31 | 163 | 156 | 342 | 313 | 641 | 625 
1H, 4T 18 | 16 | 86 | 78 | 168 | 156 | 332 | 312 
0H, 5T 3 sas || 16 | 35 | 3L | 7i | Toss 


.5 but was rounded off this way to 


* Actually each of these numbers is 62 
d frequencies equal. 


keep the sum of the observed and expecte 

After the 2000 trials involving 10,000 tosses the ratio of heads to 
tails is 0.94 to 1.. Hence there apparently is a weak but definite 
tendency for tails to appear more frequently than heads; that is, p 


is not exactly equal to 1/2. -Methods will be described in Chapter 5 


for deciding when a coin, say, is biased, and for estimating the de- 


gree of bias. 
If the observed frequencies in any column of Table 4.12 are taken 


as the f and the r is listed merely as 5, 4, 3, 2, 1, and 0, we have an 
observed frequency distribution, as in Chapter 2. If the expected 
frequencies (which follow a mathematical law) are used as the f 
column and r again is listed as 5, 4, 8, 2, 1, and 0, we have a the- 
oretical frequency distribution of the sort being discussed in this 
chapter. 

In view of the existence of a general mathematical expression for 
the binomial frequency distribution (as in formula 3.32), we might 
be curious to know if such statistical measures as the arithmetic 
mean and the standard deviation can be determined just from the 
n and p which determine the distribution. This is, in fact, true, as 


will be shown partially below. 

The discussion of mathema : 
included the information that the arit 
of occurrences of an event E over many 


tical expectation given in Chapter 3 
hmetie mean of the number 
trials coincides with the ex- 
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pected number, E (r), for any designated future trial. Hence it al- 
ready has been found from experience and intuition that for the 
binomial situation y = np. This result can be established for any 
binomial frequency distribution, but such will not be done herein. 


It also can be shown by somewhat more difficult mathematics that 
the standard deviation for a binomial frequency distribution is given 


bye = Vnpq ; consequently, given n and D, we can compute the mean 
and standard deviation very easily. This will be found to be helpful 
later in this chapter. 

Sometimes when dealing with binomial distributions it is advanta- 
geous to work with the fractional number of occurrences, 7/n, 


rather 
than with the actual number, r. 


In this case, the arithmetic mean is 
p, rather than np; and the standard deviation is V pq/n , instead of 
V npg. To illustrate the use of these formulas both for r and for r/n 
consider again the insurance example above in which » — 10 and p 
— .l. Under these circumstances the mean r is np = 10(.1) = 1, and 
the standard deviation is V npg = V10(.1)(.9) = 0.949, approxi- 
mately. The mean r/n is p = .l, and the standard deviation of the 
fraction dead is V pg/n = V (.1)(.9)/10 = 0.095, approximately. If 
the number n were sufficiently large that it would be practical, such 
information as that just derived might be useful to an insurance com- 
pany in anticipating the average number (or fraction) of death benefits 
it could expect to pay, and in making sufficient allowance for chance 


deviations from those average numbers so that adequate funds would 
be available to pay death benefits. 


PROBLEMS 


1. Use the coefficients of (1/2)r in the series for (1/2 + 1/2)9 to graph the 
binomial frequency distribution appropriate to sets of 6 trials with an event 
whose probability of occurrence is constantly p = 1/2. 

2. Under the conditions of problem 1, what is the probability that the event 
will occur at least 4 times on 6 trials. Ans. 11/32. 

3. Graph the frequency distributions for the binomial with D —1/2 and 
n = 4, 8, and 12, successively. Compute the » and e in each instance and locate 
on the scale of r: p + le, u + 2c, and w+ 30. 

4. Graph the binomial frequency distribution for p = 1/4 
from it the probability that r will be 2, 3, or 4. 

5. Check the result obtained in problem 4 by 
reading the answer from this graph. 

6. Graph the binomial frequency distribution for p- 


E 1 7, 1 — 4, and determine 
the probability that on one random set of 4 trials E will occur at least y times, 
where u = arithmetic mean. 


; ^ — 4, and read 


constructing the r.c f. graph and 
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7. Flip 3 pennies 80 times, recording the number of heads after each toss of 
the 3 pennies. Then compare the observed and the mathematically expected 
numbers of occurrences in each of the four possible classes of events in terms 
of number of heads. 

8. Perform the operations of problem 7, except to compare the observed 
and the theoretical values of the arithmetie mean. 

9. Suppose that a large group of fruit flies consists of members who are 
classified as either “normal” or “sooty.” Among 10 of these flies selected at 
random, 3 were found to be “sooty.” How frequently would that result be 
obtained if half the flies in the population are “sooty”? How frequently if 25 
per cent are “sooty”? 

10. Suppose that under the conditions of problem 9, 100 flies are chosen at 
random and 30 (same percentage as in problem 9) are “sooty.” Answer the 


same questions as in problem 9. 
Ans. 23 times in 1,000,000; 12 times in 1 billion. 


11. Construct the r.c.f. distributions for problem 3 and then determine the 


median r in each distribution. 
12. Suppose that there are two political parties interested in a certain college 


election, and that 60 per cent of the eligible voters are Progressives and 40 per 
cent are Independents. If a random sample of 10 persons is taken, what is the 
probability that a plurality of them will be Independents, even though they 


constitute the minority party? Ans. .17. 
13. Referring to problem 12, how large must the sample be before the proba- 


bility is less than 1/4 that there will be more Independents than Progressives 


in the sample? 
14. Suppose that 6 persons out of 10 selected at random in a certain city 


favor a particular flood-control policy. What is the probability of such a result 
when only 45 per cent of those in that city actually favor that policy? Ans. .16 

15. For problem 9 determine the median number of “sooty” flies among 10. 
What is the probability that the actual number observed will exceed the 


median? 
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distribution for a population of numerical measure- 
ments is intended to display in some manner the density with which 
the measurements are distributed along the scale on which they are 
measured, Such a frequency distribution indicates the region (along 
the scale of measurement) in which the measurements tend to be 
most numerous, and also shows the way in which they are dispersed 
about that region of concentration. The reader should see that these 
are the same two general matters of concern considered in Chapter 
2. Averages were used to measure general level of performance (as 


on ACE scores), and measures like the standard deviation, mean 


deviation, range, and quartiles were employed in the description of 
the dispersion of the data along the scale of measurement. This sort 


A frequency 
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of information is essential to any adequate description of a popula- 
tion, and also is vital when considering sampling problems. 

In order that we may be able to perform certain useful statistical 
analyses it usually is necessary to assume (after investigation) that 
the data conform to some general type of frequency distribution, 
such as the binomial frequency distribution considered in the preced- 
ing section. In that section a formula was used to determine the 
frequency distribution for a binomial population when the basic 
information (n and p) was available. The formulas and the proce- 
dures for their use are appropriate for discontinuous measurements 
which fall into only two categories, such as heads and tails. 

Likewise, we need a mathematical formula which is appropriate 
when continuous measurements (such as weights, heights, and ages) 
can be expected to conform to what is called a normal frequency dis- 
tribution. Mathematicians long ago derived the necessary formula, 
in fact, it has been derived several different ways, all of which—as 
rigorous derivations—are inappropriate to this book. However, it is 
possible here, and useful, to show how the normal distribution is re- 
lated to the binomial frequency distribution. 

As the number of trials (n) is increased the number of ordinates 
which graphically represent the binomial frequency distribution also 
becomes greater. As the n increases the discontinuity of the dis- 
tribution may become less important and less noticeable for many 
practical purposes. This matter is illustrated in Figures 4.214, B, 
and C, for which p = 1/2 and n = 5, 20, and 100, respectively. In 
Figure 4.214 our eyes have to search a bit for the actual form of the 
distribution; for n = 20, the points rather definitely follow a certain 
symmetrical curve quite well; and for n = 100, the points of the graph 
dot out a symmetrical bell-shaped curve quite clearly. To put the 
matter another way, if the instructor were to ask each member of 
the class to draw a smooth curve which seemed to the student to fit 
the points of the figures best, there would be considerable hesitation 
and disagreement about Figure 4.21A, much less trouble with Figure 
4.21B, and practically unanimous accord concerning the curve needed 
for Figure 4.21C. | 

The student will realize that the labor involved in the construction 
of figures 4.214, B, and C becomes increasingly great as n varies from 
5 to 100. In view of the fact that Figures 4.21B and C are closely 
approximated by continuous curves, we might hope that a relatively 
simple formula for a continuous curve might be employed instead of 
Cn, "Pd", or instead of a summation involving this formula. For- 
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Figure 4214. The binomial frequency distribution with p = 1/2 and n= 8. 
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Figure 421B. The binomial frequency distribution with p — 1/2 and n — 20. 
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Figure 421C. The binomial frequency distribution for p = 1/2 and n = 100. 


88 BINOMIAL AND NORMAL DISTRIBUTIONS Ch. 4 


tunately, it can be shown mathematically that if n is fairly large and 
p is not far from 1/2, the numbers obtained from C, pl — py" 


by setting r successively equal to 0, 1, 2, ---, and n are much the same 
as those obtained from (1/W2r-0):e" ¥ — 97 77 wherein X replaces 
T, np = p, c = V npg , and e = the base for natural logarithms. In 


particular, if p = 1/2 so that y = n/2 and o = Vn/2, it is found 
that the approximation is very close for n = 20 or more. Table 4.21 
shows the approximation when n = 20. 


TABLE 4.21 


ILLUSTRATION OF THE GOODNESS WITH WHICH THE Norman Frequency 
Curve Firs Tae BINOMIAL Frequency DISTRIBUTION WHEN p = q = 1/2 


AND n = 20 
ror X Binomial Normal Error ror X Binomial Normal Error 
0 .000 .000 on 11 .160 .161 .001 
1 -000 .000 — 12 .120 .120 .000 
2 -000 -000 ota 13 .074 .073 7  .001 
3 .001 .001 .000 14 .037 .036 .001 
4 .005 .005 .000 15 .015 .015 .000 
5 .015 .015 .000 16 .005 .005 .000 
6 .037 .036 .001 17 .001 .001 .000 
7 .074 .073 .001 18 .000 .000 beds 
8 .120 .120 .000 19 .000 .000 
9 .160 .161 .001 20 .000 .000 
10 .176 .178 .002 


If the relative frequencies calculated the two ways shown in Table 
4.21 are plotted on a common set of axes, Figure 4.22A is obtained. 
Graphically, the normal frequency distribution fits this binomial dis- 
tribution almost perfectly at the points where the binomial distribu- 
tion exists. 

The sum of all the relative frequencies (ordinates) for the binomial 
frequency distribution is 1 because it is the sum of the probabilities 
for all of the (n +1) mutually exclusive events which are possible 
under the specified conditions. Likewise the sum of all the ordinates 
of the normal curve at the points where X = 0, 1, 2,..., 19, and 20 
will add to approximately 1. If rectangles of width 1 and heights 


1 
yi = 
s V 107 


where i = 0 to 20, inclusive, are constructed 
total area also is approximately 1. Moreow 


e OG — 10)?/10 


as in Figure 4.22B, their 
er, the total area of the 
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Relative frequency 
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Number of heads (r orx) 
(e = binomial; x =normal approximation). 


Figure 4224. The normal curve fitted to the binomial frequency distribution 
with n = 20 and p = 1/2. 
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( = binomial; x =normal approximation) 
Figure 4228. Illustration of the relationship between the area under the normal 
curve and the probabilities which can be derived from a binomial distribution 
with p = 1/2 and n = 20. 
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rectangles is approximately the same as the area under the normal 
curve, as the reader can verify visually. 

With the preceding remarks in mind, consider the following two 
facts: (a) To obtain the exact probability that r will have one of 
the values from a to b, inclusive, we need to sum the ordinates, 
Ch, r(1/2") for n = 20 and r = a,a + 1,a--2,...b. (b) The opera- 
tion described in a is approximately the equivalent of finding the 
area under the normal eurve between the points X = a — 1/2 and 
X =b+41/2. The operation of a is very laborious; hence if b can 
be accomplished with much less work and is satisfactorily accurate, 
it should be the better method. As a matter of fact, this is the case, 
as will be shown by some of the subsequent discussion of this chapter. 

If the relative frequency of occurrence of a normally distributed 


measurement, X, is denoted by yi, we have the following general 
formula for yi: 


1 2/242 
(4.21) y= EME 
V 2r o 
Hence, if the y and c are known and the measurement X is known 
to have a normal distribution, we can graph the frequency distribu- 


tion by the usual methods of algebra. For example, if » = 60 and 
a = 10, formula 4.21 becomes 


(4.22) a I jg 60)*/200. 


^ 7 Và (10) 


Table 4.22 was prepared from this formula, and Figure 4.23 then 


was construeted from the pairs of values (X, yi) in that table. It 
will be left as an exercise for the student to v 
for yı by using Table VI (end of book 
(X — 60)2/200. "Thereafter, division by 10 gives the numbers in 
Table 4.22, under the heading yı. 

The following information ca 
423: (a) The normal distribu 


erify the values given 
) to obtain e", where w = 


n be obtained easily from Figure 


E ; tion curve is symmetrical about a 
vertical line through the point where X = y = 60; (b) the median 


xe the modal X, and the arithmetic mean of the X’s are equal and 
each 1s equal to 60; and (c) after (X — 60) becomes at least twice 
the size of the standard deviation, either positive or negative, the 
corresponding ordinates, y,, are very small. In faet, when (X — 60) 
becomes three times the size of the standard deviation, the corre- 
sponding y, is practically zero. Hence, it follows that for a truly 


—————— 


Sec. 42 THE NORMAL FREQUENCY DISTRIBUTION 91 


normal distribution the useful range of the X's is about six times the 
size of the standard deviation. The reader should recall that this 
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Figure 423. The normal frequency distribution curve for a population with 
4 = 60 and c = 10. 


approximate relationship between the range and the standard devia- 
lion was used in Chapter 2 when the normal frequency distribution 
was mentioned first. 

TABLE 4.22 


COORDINATES OF POINTS SATISFYING FORMULA 4.22 ror A NORMAL 
Frequency DISTRIBUTION WITH u = 60 AND o = 10 


x Y x y 
30 000 65 035 
35 002 70 024 
40 005 75 013 
45 .013 80 005 
50 .024 85' 002 
55 .035 90 000 
60 .040 


Most students have asked (or heard someone else ask) an in- 
structor: “Do you grade on the curve?” The curve which the stu- 
dent has in mind is the normal frequency distribution curve; but it 
appears from the diseussion above that there is a different normal 
curve for each combination of p and o. This is correct; but the stu- 
dent who is asking such a question is chiefly interested in his per- 
formance relative to the other persons who took the same examina- 
tion. He hopes that a grade of 40 on one test is as good as a grade 
of 70 on another test if its relative rank among all grades on that 
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examination is the same in both instances. In other words, what is 
of interest is the general form of the frequency distribution of a set 
of grades and a system for comparing one person's grade with all 
the other relevant grades. The discussion which follows is intended 
to show how we can reduce all formulas for partieular normal fre- 
quency distributions to one general—and simpler—formula which 
preserves all the information which we usually desire from such a 
formula. 

Multiply through formula 4.21 by ø and then make the following 
substitutions of variables: let y = oy; and let A = (X — p)/o. The 
result of these substitutions is the following formula for the standard 
normal frequency distribution: 


(4.23) y=. 


2v 


What has been done by means of these substitutions can be described 
graphically as follows: (a) Both the vertical and the horizontal axes 
have been marked off in multiples of the standard deviation, c; and 
(b) the peak of the curve (which is above the point where X = p = 
md = MO) has been placed above the point where A — 0. Hence 
the X’s which are less than y now correspond to negative values of A, 
those which are greater than p. now correspond to positive A's. 

The first two columns of Table III give the numbers needed to 
construct the graph of equation 4.23. Figure 4.24 was constructed 
by means of this table. Figures 4.23 and 4.24 are essentially the 
same curve; the only difference lies in the way the vertical and 
horizontal axes are sealed. In Figure 4.23 A would be 0 under the 
point where X = 60, would be +1 under the point where X =70 
because 70 is one times the standard deviation larger than 60, the 
mean. The A would be —1 under the point where X = 50 because 
50 is one times the standard deviation smaller than the mean, 60. 
The other corresponding values of A and X can be determined in 
the same manner. 

An illustration of the application of standard normal frequency 
distributions to a generally familiar situation can be obtained from 
the batting averages of baseball players. The conditions which might 
affect batting averages may change from season to season or from 
league to league so that such averages for the different situations are 


not directly comparable. For example, the ball may be livelier one 


season than during another; or perhaps the pitching may generally 
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be poorer one season than another. Hence an average of .350 during 
a season with a lively ball and generally mediocre pitehing might not 
represent a better batting performance than an average of .320 at- 
tained with a less lively ball and more effective pitching. 

These matters should be reflected in the general level of batting 
averages and in the consistency with which players’ averages grouped 
about the general average. That is, the mean and the standard 
deviation of the batting averages should be taken into account, This 
is precisely what is done when standard normal units are employed. 


40 
35 
.30 
25 
Y .20 
45 
10 


05 
—| Bs 
=> -2 E! 0 1 2 3 
A 
Figure 424. Graph of the standard normal frequency distribution whose for- 
mula is given by equation (433). 


There also seems to be evidence indicating that batting averages 
can be assumed to be reasonably normal in their distribution. 

Batting averages for some of the better batters from the National 
and American leagues (undifferentiated here but kept separate when 
the standard normal units were computed) are presented in an 
ordered array in Table 4.23, first as they usually appear and then in 
terms of standard normal units. 

Some interesting conclusions can be drawn from Table 4.23, al- 
though they might be disputed upon the basis of other evidence and 
other points of view. For example: (a) The best batter listed for 
1940 (Deb Garms) ranks fifteenth in standard normal units, con- 
sidering all 4 years together. (b) The batter with the best average 
of all (Ty Cobb)—when the level and dispersion of batting averages 
within a league and year are taken into account—had an actual 
average of .385, which was bettered by five other batters unless 
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TABLE 4.23 


ORDERED BATTING AvERAGES FOR INDICATED Years BEFORE AND AFTER 
CONVERSION TO STANDARD NORMAL UNITS 


(Includes higher ranking batters who were in at least 75 games during the 


Season.) 
Actual Average Standard Normal Units 
1910 1920 1930 1940 1910 1920 1930 1940 
.385 .405 -401 .355 3.43 3.17 2.45 1.96 
.364 .388 .393 .352 2.85 2.72 2.24 1.88 
.340 .382 .386 .348 2.19 2.57 2.06 1.78 
.331 .376 .383 .344 1.94 2.41 1.98 1.68 
.325 .370 .381 .342 1.77 2.25 1.93 1.63 
.322 .369 .379 .340 1.69 2.23 1.88 1.58 
.321 -360 .379 .340 1.66 1.99 1.88 1.58 
.320 .355 .374 .937 1.63 1.86 1.75 1.51 
.312 .351 .373 .326 1.41 1.75 1.73 1.23 
.309 .340 .368 .322 1.33 1.47 1.60 1.13 
-308 .339 .367 .320 1.30 1.44 1.57 1.08 
306 .338 .366 .319 1.25 1.41 1.55 1.06 
.305 .338 .366 .318 1.22 1.41 1.55 1.03 
.304 .334 .359 .317 1.19 1.31 1.37 1.01 
-802 .333 .359 .317 1.14 1.28 1.37 1.01 
-301 .333 .357 -316 LY 1.28 1.31 0.98 
.300 .332 -356 .316 1.08 1.26 1.29 0.98 
.300 .328 .355 -816 1.08 1.15 1.26 0.98 
.298 .328 .354 .314 1.02 1.15 1.24 0.93 
.298 .328 .350 .313 1.02 1.15 1.13 0.90 


Standard normal units are employed. In these units, his average 
is a decided stand-out, being 0.26 unit ahead of the runner-up. 
(c) In general, there is reason to believe that the batters in the year 


1940 were not up to the standards of the other years shown in Table 
4.23, especially those of 1920 and 1930. 


PROBLEMS 


1. Graph the no 
from equation 421. 


2. Graph the normal distribution of 
3. Graph 


rmal frequency distribution with »=4 and c —2 directly 


n=8andp=1/2. Do 
P = 1/4, and note the decrease in the 
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4. Graph as in problem 3, with » — 12 instead of 8, and comment on the 
effect of increasing the size of n. 

5. Graph the frequency curve for a normal population with & — 10 and 
o = 2, and estimate roughly from the graph the proportion of all the measure- 
ments in this population which are greater than or equal to 12. 

6. Make a frequency distribution table for the birth weights of male guinea 
pigs as recorded in Table 2.61, compute the # and e, and then graph the normal 
curve with the same 4 and øe. How does the graph compare with the frequency 
distribution eurve made directly from your distribution table? 

7. Perform the operations of problem 6, using the records for the female 


guinea pigs in Table 2.61. 
8. Perform the operations of problem 6, using the 4-day gains of male guinea 


pigs as listed in Table 2.62. 

9. Perform the operations of problem 6, using the 4-day gains of female 
guinea pigs as given in Table 2.62. 

10. Graph the binomial frequency distribution for » — 16 and p — 1/2 and 
then plot the corresponding normal distribution on the same axes, adjusting 
the height to fit the binomial. Also construct for each value of r a rectangle 
of base r — 1/2 to r+ 1/2 and height = Cy, ,"p'q19—7. Then indicate on your 
Eraph the area under the normal curve which is approximately equal to 

: P(8- r- 11), the probability that r will have a value from 8 to 11, inclusive. 

11. Perform the operations of problem 10, with p — 3/5. 

12. Choose any available source and compare the batting averages in the 
National League for 1940 and 1950, using the leading 25 players in each year 
and converting the batting averages to standard normal units. 


43 DETERMINATION OF THE PROPORTION OF A NOR- 
MAL POPULATION OF MEASUREMENTS INCLUDED 
BETWEEN ANY SPECIFIED LIMITS 


In Chapter 2 the student was given the opportunity to learn how 
to construct an ».c.f. distribution, how to graph it, and how to deter- 
mine from this graph the limits on X which would include any 
Specified proportion of the data so summarized. Furthermore, the 
inverse process also was discussed, namely, the determination of the 
proportion of the data which lies within specified limits. It is de- 
sirable to be able to obtain the same sort of information for nor- 
mally distributed groups of measurements. The basis for such a 
procedure was given in the preceding section. 

There is, however, one major difference between the process taught 
in Chapter 2 and that which is necessary to handle the standard 
normal frequency distribution. In the latter situation there is no 
distribution table with class intervals and cumulative frequencies 
determined by means of certain arithmetic procedures. Instead the 
T.c.f. distribution must be derived from the formula for the normal 


96 BINOMIAL AND NORMAL DISTRIBUTIONS Ch. 4 


distribution function. The mathematical procedures needed in this 
process are beyond the level of this course; but the reader can under- 
stand that the r.c.f. curve of Figure 4.31 plays the same general role 
in the analysis of normal data that the r.c.f. curves did in Chapter 2. 
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Figure 431. Relative cumulative frequency distribution for the standard nor- 
mal frequency distribution described by formula 4.23. 


The following problems will illustrate the uses to which Figure 
4.31 can be put. 


Problem 431. Determine the limits on A for the third quartile of a standard 
normal population of measurements. 


The limits required are obviously the median and Qs, respectively. 
lf we read horizontally from .50 on the vertical scale over to the 
normal r.c.f. curve and then downward to the horizontal scale, we 
find that A = 0, as is to be expected. Doing likewise for .75 on the 
vertical scale, we find that A = 0.68; therefore, the limits on the third 
quartile are A= 0 to A = 0.68. Since these limits apply to any 
standard normal distribution, the limits of the third quartile for any 


particular normal distribution in terms of a measurement, X, can be 
obtained from the relation: A = (X — p)/o. 


Problem 4332. What is the probability that a measurement chosen at random 
from a normal population with 4 = 50 and ø — 5 will be found to lie between 
50 and 52? Between 48 and 50? Between 60 and 65? 


To reduce this specific normal distribution to the standard normal 


distribution, substitute »= 90 and o=5 into A — (X — p)/o so 
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that à = (X —50)/5. If X = 50, A = 0; and if X = 52, A = 0.40. 
To answer the specifie question asked regarding probabilities it is 
necessary now to extend somewhat the concept of probability pre- 
viously employed herein. 

When the possible events correspond to positions along a con- 
tinuous scale of measurement, the number of possibilities (previously 
denoted by N) is infinite. Moreover, the likelihood of occurrence 
changes along the scale. It no longer is useful to ask for the prob- 
ability that X will have a specific size along this scale on any future 
trial. Instead, an event E will consist of X lying within certain limits. 
The probability that a randomly chosen X will fall between the 
limits X = a to X = b now will be defined to be the proportion of 
all the X’s in the population which are included in that interval. 
Graphically, this will be the proportion of the whole area under the 
frequency distribution eurve which lies between X — a and X — b. 
"Therefore, in problem 4.32, we need to know what proportion of this 
normal population lies between A = 0 and A = 0.40. From Figure 
4.31 it is learned that 50 per cent of this population has values less 
than 0 and about 66 per cent has values less than 0.40; therefore, 
about 16 per cent of the numbers in a normal population have A's 
between 0 and 0.40. It follows that P(50 = X £ 52) = .16. 

It is concluded from the symmetry of the normal curve that 
P(48 £ X £ 50) = .16 also. Furthermore, P(60 — X £ 65) = .025 
because 2.5 per cent of the X’s have sizes within the limits 60 to 65. 

As a final illustration of the use to which Figure 4.31 can be put _ 
consider a problem of grading “on the curve.” 


Problem 433. Given that a large group of grades in psychology conform to 
a normal distribution with a = 75 and o = 7, suppose it is required to put letter 
grades on these scores in the proportions: 74:20B:46C:20D:7F. What are the 


numerical limits on each letter grade? 


It is useful first to translate the proportionality above into a dif- 
ferent form as follows. Starting with the lowest grade, F (which 
will be represented at the left-hand end of the scale of A), we have the 
following facts: 


.07 of the grades are to be F; 

27 of the grades are to be D or F; 

.73 of the grades are to be C, D, or F; and 
.93 of the grades are to be B, C, D, or F. 
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It is learned from Figure 4.31 that, for a normal population, 


-07 of the X's correspond to A= —1.48; 
.27 of the X's correspond to A £ —0.65; 
-73 of the Ms correspond to A= +0.41; and 
-93 of the X's correspond to A= +1.50. 


In terms of the X's, we have the following facts obtained from the 
relation: A = (X — p)/s: 


-07 of the X's are £ 65— ; 
27 of the X’s are = 70+; 
-73 of the X’s are = 78—; and 
-93 of the X’s are = 86— ; 


_ therefore, the required numerical limits on the letter grades are as 
follows: 


A= 86 on; B= 78 to 85; C=71to77; D = 65 to 70; 
F = below 65. 


The preceding applications of Figure 4.31 have given approximate 
answers to the questions asked, and these answers are as accurate 
as the graph used and our ability to read values from it will allow. 
It seems rather obvious that a more accurate and, if possible, more 
convenient method is desirable. A method of this sort is available 
through the use of statistical tables. They perform essentially the 
same service as Figure 4.31, Although their derivation is not appro- 
priate to this book, the reader can simply keep in mind the fact that 
the information obtained from Table III is the same as that which 
can be derived directly from Figure 4.31, but is in a more accurate 
and convenient form. : 

It will be left as an exercise to rework problems 4.31 to 4.33, inclu- 
sive, using Table III in place of Figure 4.31 as was done above. 

It is worth while to investigate a set of data from Chapter 2 to 
see if it seems to be following a normal frequency distribution, at 
least approximately. Actually it is not feasible at this level of statis- 
ties to decide this matter rigorously; but some useful information 
can be obtained nonetheless. | 

Consider first the ACE scores of Table 2.01, their frequency dis- 
tribution in Table 2.42, and the graph of Figure 2.41, Obviously, 
some approximation is introduced by using such a summary—espe- 
cially one with only 12 class intervals—but the approximate distri-. 


bution will serve the purpose here. The graphs of Figure 2.41 would 
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resemble those of Figures 4.23 and 4.31 rather closely if the former 
were smoothed eurves instead of broken-line graphs. Hence, it 
appears, superficially, that the population of ACE scores follows à 
normal distribution fairly well if the more general and important 
features are the only ones considered. To be more definite, consider 
the following information: 

a (a) For the ACE scores, p = 96, approximately, and the median 
is 97. These averages are equal in a normal distribution but the 


discrepancy is not at all large. 
(b) The following table shows the corresponding proportions 


within stated, and important, intervals on X: 


PERCENTAGE OF THE POPULATION INCLUDED 


Interval on X ACE Normal Difference 


u + 0.50 37.6 38.3 —0.7 
u= 1.00 67.1 68.3 —1.2 
pt loe 85.3 86.6 —1.3 
pt 2.00 95.2 95.4 —0.2 
u + 2.50 99.2 98.8 +0.4 
pt 3.00 99.8 99.7 +0.1 


Although the deviations from normal expectancy are somewhat 
systematic, there being a small deficiency in the middle and a smaller 
excess in the tails of the distribution, the ACE distribution still 
seems to be approximated by the normal quite well. 

If, then, it is assumed that the ACE scores do essentially conform 
to a normal distribution, the substitution A = (X — 96) /26 would 
convert the scores of Table 2.01 into standard normal measurements. 
The graph of their frequency distribution essentially would be Fig- 
ure 4.23, the r.c.f. curve would be given approximately by Figure 
4.31, and Table III would present the distribution in tabular form. 
The statistical analysis of these data then might be more easily and 
efficiently accomplished than would otherwise be the case, and little 
or no important information would be lost in the process. 


PROBLEMS 


ution has p — 1/4 and n = 80, calculate 
P(r > 25) by means of the normal approximation to this binomial distribution. 

2. Suppose that all the residents of a certain city definitely have made up 
their minds about a partieular civic issue, and that 55 per cent favor one specific 
decision. What is the probability that on a random sample of 100 interviews 


l. If a binomial frequency distrib 
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less than 50 will favor this decision, that is, it will seem that the residents are 
against this decision when they actually favor it? Ans. 43. 

3. Suppose that an event E, occurs with a relative frequency p = 1/2, and 
that n random observations are to be made under these conditions. How large 
must n be before the number of occurrences, 7, of E, will fall within one per 
cent of its mathematical expectation with probability equal to .10? That is, 
you must choose n so that P(n/2 — n/200= r=n/2 + n/200) = .10. 

4. Suppose that a basketball team has established in previous games that it 
is safe to assume that the probability on each shot by a team member that a 
goal will be scored is 35. What is the probability that in a game in which they 
take 60 shots from the field they will hit less than 18 if the idealized assump- 
tions just stated are good? Ans. 17. 

5. Suppose that a pair of unbiased dice are to be rolled 50 times. What is 
the probability that a 6 or a 7 or an 8 will appear on 20 to 25, inclusive, of these 
throws? 

6. According to certain records the average length of growing season at Man- 
hattan, Kansas, is 172 days. If the standard deviation about this mean is 13 
days, and if lengths of growing seasons in this area are normally distributed, 
what is the probability that the next growing season will be long enough to 
mature a crop which requires 190 days to complete its development? Ans. .084. 

7. Suppose that when wood blocks of a certain type, 2 by 2 by 8 inches, are 
tested for strength with the proper engineering equipment, their strengths are 
normally distributed with mean equal 13,000 pounds and standard deviation 
equal 3600 pounds. How many blocks out of 100 tested would you expect to 
have strengths below 6000 pounds? Between 10,000 and 15,000 pounds? 


8. If you are told that the heights of 10,000 college men closely follow a 
normal distribution with x = 69 inches and e = 25 inches: 


(a) How many of these men would you expect to be at least 6 feet in height? 

. Ans. 1150. 

(b) What range of heights would you expect to include the middle 50 per 
cent of the men in the group? Ans. 673 to 707. 


9. Assuming that the wages of certain laborers in the building trades are 


normally distributed about a mean of $1.80 per hour with a standard deviation 
of 30 cents: 


(a) What proportion of the laborers receive at least one dollar per hour? 
(b) What range includes the middle two-thirds of these laborer's wages? 


10. Suppose that tests have indicated that certain silk fabrics have breaking 
Strengths which are normally distributed about a mean of 27 pounds, with 
7 — 8; whereas, materials with a mixture of silk and rayon have 4 = 37 pounds 
and c — 9. How likely is it that a piece of silk selected at random will be at 
least as strong as the average for the silk and rayon mixture? 
it that a randomly chosen piece of the silk-rayon mixture will 
than the average for silk? 


Ans. P = 11, 13. 
ii. Suppose that the persons whose ACE scores are in Table 2.01 are to be 
given letter grades on the assumption that these scores are normally distributed 


with » = 95.7 and c = 26.1. If 10 per cent are to get A’s and 22 per cent D’s, 
compute the score limits on each letter grade. 


How likely is 
be no stronger 
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12. Suppose that 52 per cent of the voters in a certain city are in favor of 
a particular one of the possible sites for a new high school. If 100 voters are 
to be selected at random, what is the probability that less than 50 per cent 
will vote in favor of this site? If the poll is so taken that 60 per cent of those 
who favor that site will not participate in the poll, what now is the probability 
that less than 25 per cent of a sample of 100 will vote for the site in question 
which 52 per cent of the voters actually favor? Ans. 31, 12. 


4.4 USE OF THE NORMAL DISTRIBUTION TO 
APPROXIMATE PROBABILITIES FOR A 
BINOMIAL FREQUENCY 
DISTRIBUTION 


Another important use to which the normal ».c.f. distribution can 
be put has been suggested previously, namely, the approximation of 
the summation of Cn, ,p'qg"-" from r = a to r = b, when n is at all 
large and p is close to 1/2. It has been indieated that this sum is 
approximately equal to that area under the normal curve between 
the points X = a — 1/2 and X» = b + 1/2. Moreover, it has been 
shown that the area under the normal curve between any two points 
along the X-axis ean be obtained quite easily from an r.c.f. curve 
or from Table III. 

To illustrate this process and to indicate its accuracy, suppose 
n=20 and p= q = 1/2, and that it is required to determine 
P(rz 12). For this binomial distribution, 4 = np = 10 and o = 
Vnpg = V5; hence the normal distribution with these parameters 
will be employed in the approximation. Also, X, = 11.5, and X» = 
20.5. In terms of standard normal units, 


(11.5 — 10)/2.24 = +0.67, and 


M 
M = (20.5 — 10)/2.24 = +4.69. 


By means of Table III and some interpolation it is found that ap- 
proximately 25 per cent of a standard normal population has num- 
bers between these A-limits; hence P(r = 12) = .25, approximately. 

Using the last column of Table VII from r = 12 on down, and 
using a divisor of 21% = 1,048,576, the exact probability—to 4 deci- 
mals—that r will have some size from 12 to 20, inclusive, is found 
to be .2517. Certainly the normal approximation of .25 is excellent 
for most purposes, and the labor saved is considerable. 
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PROBLEMS 


1. Given a large number of college grades which follow a normal distribution 
with « = 65 and o = 10, what proportion of the grades would you expect to lie 
in the interval from 50 to 70, inclusive? 

2. Referring to Figure 431, how probable is it that 3 random selections from 
this population will each have A's = 2? Ans. P — .000027. 

3. What proportion of the measurements in a normal population would you 
expect to lie beyond X = 1.1 if » = 0.5 and e = 0.25? 

4. What proportion of the data described in problem 3 lies at least 0.15 unit 
from the arithmetic mean if the numbers are in an array? Ans. .55. 

5. Certain frost data collected in the neighborhood of Manhattan, Kansas, 
over a 69-year period indicates that the average date of the last killing frost 
in the spring is April 24, with a standard deviation of 10 days. Assuming à 
normal frequency distribution and assuming that the date of the last killing 
frost cannot be predicted from a current year's weather, what is the probability 
that the last killing frost next spring will come on or after May 1? 


45 STUDYING THE NORMALITY OF A FREQUENCY 
DISTRIBUTION BY RECTIFYING THE r.c.f. CURVE 


A method was given in a preceding section for detecting gross non- 
normality by calculating the proportions of a population lying within 
such intervals as » + ke and comparing these with those proportions 
which are typical of a perfectly normal population of measurements, 
A graphic method will be presented in this section which will make 
it quite easy to compare the whole of a population with a standard 
normal population. The graphic method has these advantages: (1) 
Like other graphs, it utilizes the eye-mindedness of many persons. 
(2) It compares all the distribution with a standard normal instead 
of comparing a few segments such as p + 0.50, p + lo, ete. How- 
ever, this graphic procedure has the disadvantage that it may encour- 
age a hasty acceptance of the assumption that the given population 
is sufficiently near normal for the purposes at hand. More rigorous 
tests of normality exist in more mathematical textbooks, which 
can be consulted if the situation demands that additional care, It 
will be seen when the Central Limit Theorem is discussed in 
chapter that a considerable amount of non-normality can be toler- 
ated in sampling studies; hence a precise—and laborious—test for 
normality is not often employed. In such situations a graphic test 
may be sufficiently reliable. 

The process of rectifying a curve y = f(x) 
procedure of this section, is one of changing t 


a later 


, Which is the basic 
he scale of measure- 
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ment of either, or both, v and y so that the new graph of y = f(x) 
becomes a straight line. That is, a curved line is straightened out by 
à change of scale. 

Because the reader is assumed to be familiar with logarithms, the 
description of a method for rectifying a normal curve will be pre- 
ceded by a similar discussion regarding logarithmie and exponential 
curves. If Y = logy X, as in Table 4.51 for selected X"s, the pairs 
of values (X, Y = logo X) plot on the curve of Figure 4.51. 


TABLE 4.51 
Soma PAIRS or NUMBERS WHICH SATISFY THE Equation Y = Loci X 

X y X dg X Y 
1 0.00 100 2.00 500 2.70 
3 0.48 150 2.18 600 2.78 
8 0.90 200 2.30 700 2.85 
10 1.00 250 2.40 800 2.90 
40 1.60 300 2.48 900 2.95 
50 — 1,70 350 — 2.54 1000 — 3.00 
70 1.85 400 2.60 

4 

3 

Y 
2 
1 


100 200 300 400 500 600 700 800 900 1000 
X 


Figure 451. Graph of Y = 10910 X for X in the interval 1-5 X = 1000. 


It is obvious that as the size of X increases, the size of Y — logio 
X increases less and less for equal increases in X. For example, 
when X changes from 500 to 600, log X changes by 0.08; but when 
X changes from 900 to 1000 (another increase of 100), log X 
changes by only 0.05. It is typical of straight-line (linear) mathe- 
matieal relationships that Y changes the same amount for equal 
increases in X. In other words, Y changes uniformly with increas- 
ing X. If log X is put on a uniform scale, as in Figure 4.52, and 
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the corresponding X’s matched with their logarithms, the X-scale is 
what is called a logarithmic scale. Figure 4.53 shows the effect of 
graphing Y = log X against Y when X is on a logarithmic scale. 


| | | l | | | l I l l 
Uniform scale, log X: 0.0 0.1 02 03 04 05 06 07 08 09 10 


7. 8 9 10 
Figure 4.52. Matching of the logarithmic and the arithmetic scales of a 
measurement, X. 


Corresponding X: 1 2 3 4 5 6 


1 10 100 1000 
X on logarithmic scale 


Figure 4.53. Graph of Y = logio X when X is scaled according to the logy) X 
as derived from Figure 4.52. 


As can be seen, the graph is a straight line, and, for any equal dis- 
tance along the horizontal axis, the Y changes by the same amount. 
It is noted that the X-axis falls into parts of equal length: one for 
numbers from 1 to 10, one for numbers from 10 to 100, and another 
for X’s between 100 and 1000. This corresponds to numbers whose 
logarithms have characteristics of 0, 1, and 2, respectively. Graph 
paper with one scale logarithmic and the other arithmetic will be 
called semi-log paper. When it has three repeated sections along 
one axis (X-axis in Figure 4.53) it is called three-cycle semi-log 
paper. The three cycles correspond to any three successive char- 
acteristics of logarithms, that is, to numbers whi 
three successive powers of 10. 

Figure 4.54 illustrates the use of semi-log paper to rectify an 
exponential curve. In this case Y — 2e%%, but any base for the 
power could be used. Clearly logio Y = logio 2 + 3X logro e; or 


ch fall between any 


— 
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logio Y = 1.30X + 0.30, approximately. This will be a straight 
line if Y is measured on a logarithmic scale, as in Figure 4.54. Table 
4.52 gives the values used in plotting Figure 4.54. 


TABLE 4.52 
VALUES or 2e* ror SELECTED X’s 
Xx Y = 2e* X Y =928% 
0 2.00 1.15 63.00 
0.25 4.23 1.25 85.04 
0.50 8.96 1.50 180.03 
0.75 18.98 1.75 381.14 
0.85 25.61 2.00 806.86 
1.00 40.17 
1000 
100 
Y 
10 


E 0.2 0.4 0.6 08 10 12 14 16 18 20 
x 


Figure 454. Graph of Y =2e3% on semi-log paper. 
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With the foregoing introduction to the method of rectifying 
curves the same general process will be applied to the normal r.c.f. 
curve. As noted earlier, one of the questions which may arise in prac- 
tice is whether or not a given type of numerical measurement does fol- 
low a normal frequency distribution. Although the graphic proce- 
dure to be illustrated is definitely not a rigorous test for normality, 
it may be sufficient for practical purposes. 

The vertical scale to be employed will have what is called a nor- 
mal r.c.f. scale marked off in whole percentages. "The horizontal 
scale will be an arithmetic (or uniform) scale for the A used previ- 
ously in discussions of the standard normal frequency distribution. 
Figure 4.55 illustrates the process of obtaining the vertical scale in 
a manner which is analogous to that illustrated earlier for semi-log 
paper. It was constructed with the aid of Table III by plotting the 
normal r.cf. as a percentage (top scale) directly over the cor- 
responding A, and then interpolating for the “integral r.c.f. per cent” 
found on the middle scale of Figure 4.55. This middle scale is the 


one to be used here in studying the approximate normality of fre- 
quency distributions. 


Normal r.c.f. 23 159 
uu T e M 9/7 (table) 
Integral 3% | 110 01 ot E LL DL 01 I L1 1 (cale for 
1 23 5 10 20 3040 50 60 70 80 90 9597 99 vertical axis) 
Standard Pet to P Pod d |? gd g-(aefor 
normal units, à — —2 x 0 +1 +2 horizontal axis) 


Figure 4.55. Determination of the scales for a normal-arithmetic graph. 


As is to be expected after the discussion of semi-log graph paper, 
it is not necessary to go through the work back of Figure 4.55 be- 
cause graph paper already exists on which we can do this graphing. 
Figures 4.564 and B were constructed on normal-arithmetic paper 
to illustrate the way the normality or the non-normality of a dis- 
tribution affects a graph on such paper. Four distributions are 
employed in these illustrations: 


(a) Truly normal distribution of Table III; 

(b) ACE scores of Table 2.01; 

(c) the data on farm acreages in Table A (below) ; and 
(d) the definitely non-normal distribution of Table 4.53. 
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TABLE 4.53 
A FICTITIOUS Non-NorMAL FREQUENCY DISTRIBUTION 
Class Interval f  T.c.f. A Class Interval $ ej. A 

55-59.9... 1 100.0 3.9 10-14.9... 60 80.1 0.61 
50-54.9... 2 99.8 3.5 5- 9.9... 80 69.5 0.25 
45-49.9... 5 99.5 3.2 0- 49... 100 55.5 —0.11 
40-44.9... 8 98.6 2.8 — 5to—0.0..1 90 37.9 —0.48 
35-39.9... 10 97.2 2.43 —10 to — 5.0..1 80 22.0 —0.84 
30-34.9... 12 95.4 2.07 —15to —10.0..1 30 7.9 —1.21 
25-29.9... 15 93.3 1.70 —20 to —15.0..1 10 2.6 —1.57 
20-24.9... 20 90.7 1.34 —25 to —20.0..1 5 0.9 —1.94 
15-19.9... 40 87.1 0.98 um 


(A) e Normal (Table III) (B) x Farm acreages 
95 6 ACE scores 6 Table 4.43 


Percentage 
N 
3858838 


-2 -1 0 41 +2 -2 -1 0 +1 +2 


Figures 4.56. Some graphs of r.c.f. distributions on normal-arithmetic paper. 
It should be evident from Figures 4.56 that the following are true: 


(a) The frequency distribution from Table III yields a perfectly 
straight line when r.c.f. as a percentage ap ploten against AGIS, 
normal-arithmetie graph paper. 

(b) The frequency distribution of the ACE scores apparently is 
quite near to normal because the points of their r.c.f. graph on 
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normal-arithmetic graph paper appear to deviate only slightly from 
a straight line. 

(c) The distribution of the farm acreages in Ness County, Kan- 
sas, is essentially normal except that the lower end of the distri- 
bution is missing, that is, the distribution is truncated. 

(d) The fictitious distribution of Table 4.53 clearly is not nor- 
mal because the points of the r.c.f. graph definitely do not follow 
a straight line on normal-arithmetic paper. 


It should be noted with respect to the conclusions above that only 
gross (and hence certainly serious) non-normality will show up 
under this sort of serutiny. A look at the frequency distributions 
associated with (b) and (c) above shows that there certainly is 
some lack of normality. Figures 4.56 show this clearly; but whether 
or not the relative departure from a straight line is negligible will 
depend on the particular circumstances. Discussion to be given 
in Chapter 6 will be helpful in this decision. 


PROBLEMS 


1. Plot the following pairs of values of X and Y as points on a graph, using 
semi-log paper with Y measured on the logarithmie seale. "Then determine the 


slope of the straight line through the points and relate it to the way Y changes 
per unit increase in X. 


X5 1,9 s 4 B ë 
Y: 2, 6, 18, 54, 102, 486. 


2. Plot Y; = logo X and Ys = 5 logio X on the same sheet of arithmetic graph 
paper and also on the same sheet of semi-log paper. What effect does the coeffi- 
cient 5 have on these graphs? 
-(x-0? 
3. Plot the r.c.f. curve for y = gi "b 
paper. 5V 27 

4. Plot the following tabular r.c.f. distribution on normal 


comment on any lack of normality revealed by your graph. 


on normal-arithmetic graph 


-arithmetie paper and 


Class Class Class 
Interval rief. Interval fif. Interval nef. 
80-82.99... 1.00 65-67.99... -33 50-52.99... .09 
71-19.99... .90 62-64.99... .26 47-49.99... .07 
74-76.99... 74 59-61.99... .20 44-46.99... .05 
71-73.9! -50 56-58.99... .16 41-43.99... .02 
68-70. 9! .40 53-55.99... .13 


5. Make an r.c.f. distribution for the fly counts of problem 1, section 24. 


Plot this distribution on normal-arithmetic Paper, and discuss any apparent 
non-normality of this distribution. 
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6. Perform the operations required in problem 5 for all the birth weights of 
female guinea pigs listed in Table 2.61. 

7. Perform the operations required in problem 5 for all the birth weights of 
male guinea pigs listed in Table 2.61. 

8. Perform the operations required in problem 5 for all the 4-day gains of 
female guinea pigs listed in Table 2.62. 

9. Perform the operations required in problem 5 for all the 4-day gains of 
male guinea pigs listed in Table 2.62. 

10. Perform the operations required in problem 2 for logarithms to the 
natural base e rather than 10, and comment on the effect of this change on the 


graph. 
REVIEW PROBLEMS 


1. If you are among 1000 persons, each of whom purchases a one-dollar lot- 
tery ticket for a prize of $1000, what is the expected value of your ticket in 
the mathematical sense? 

2. Determine the expected frequencies of sums of 3, 4, 5, and 6, respectively, 
when three unbiased dice, are thrown simultaneously 1000 times. 

Ans. 46, 13.9, 27.8, 46.3. 

3. If 25 pennies and 15 dimes are placed in individual envelopes, thoroughly 
mixed, and presented to you for the selection of one envelope, what is the 
Probability that you will get a dime? What is your mathematical expectation 
on such a draw? 

4. How many two-digit numbers can you make up by selecting any number 
from 1 to 9, inclusive, for each digit? How many numbers could you form if 
none is to contain the same digit twice? Ans. 81, 72. 

5. Suppose that a turtle is hatched at point A and then wanders over a uni- 
form terrain in search of food. If he never wanders more than 1000 yards 
radially from spot A, and if he moves over the area in such a way that he is 
equally likely to be on any preassigned areas of a specified size, what is the 
probability that he will be within a circular area of 100 square yards whose 
center is 300 yards from, and northeast of, the spot 4? 

6. Table A (below) and Figure A present the distributions of the various sizes 
of farms in Ness County, Kansas. If a stratoliner were to drop a package by 
parachute so that it will be sure to land on Ness County, but the pilot cannot 
tell where, what is the probability that it will fall on a farm of more than 
1000 acres if 10 per cent of the county is not in farm land and that 10 per cent 
is uniformly distributed over the county? Ans. 30. 


7. If 100 farmers are to be selected from Ness County without knowledge of 


the areas of their farms, and if one supposes one farmer per farm, what is the 


mathematically expected number of representatives from farms covering less 
than 500 acres? What fraction of the county’s farm acreage do they represent? 
8. Determine graphically the lower limit of the sixtieth percentile and of 
the third decile for the data of Table A. Ans. About 520 acres; 260 acres. 
9. Table B presents a summary of the years of schooling had by all legal 


residents of Kansas who were 25 years of age or older on April 1, 1940. Con- 


struct what appears to you to be a good graphic presentation of these data. 
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10. If a roving reporter were to go all over Kansas, impartially asking persons 
their opinions on a certain educational matter, what proportion of his inter- 
views would you expect to be with persons who have had at least two years 
of college education if he talked only to persons who were at least 25 on 
April 1, 1940? What percentage would have no college education? 

Ans. 9 per cent, 88 per cent. 

11. If an insecticide is known to be 99 per cent lethal to a certain species of 
insect, what is the probability that less than 5 will survive if 150 selected at 
random are sprayed with this spray? 

12. Suppose that a particular variety of apple grown under specified condi- 
tions produces yields (per tree) which are normally distributed with 4=8 : 
bushels and c = 2.5 bushels. What is the probability that a randomly chosen 
tree will be found to yield less than 5 bushels? "That two such trees will each 
be found to yield less than 5 bushels? Ans. .12, 014. 

13. Obtain records like those in Table B from the latest census, make an 7.c.f. 
graph for those data, and determine the median years of schooling. Compare 
with the median for 1940 and draw any appropriate conclusions. Use college 
years as 13, 14, 15, 16, and 17. 

14. Plot the 7.c.f. distribution of Table A on normal-arithmetic paper and 


comment on the apparent normality, or lack of it, for this distribution of farm 
sizes. 


1.00 1.00 


90 90 


80 80 


Refers to left-hand scale 


Relative cumulative frequency 
in 
o 


Proportion of total acreage 


———-— Refers to right-hand scale 


1 201 401 601 801 1001 1201 1401 1601 1801 2001 
Acres 


Figure A. Two types of relative cumulative frequency distributions for the 
sizes of farms in Ness County, Kansas, in 1940. 
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TABLE A. DISTRIBUTION or Farm ÁCREAGES IN THE WHOLE OF 


Acreage Frequency, 
Interval f 

Over 2000 28 
1801-2000 7 
1601-1800 14 
1401-1600 21 
1201-1400 28 
1001-1200 35 
801-1000 77 
601— 800 194 
401- 600 236 
201- 400 320 
1- 200 EZ 
Total 1233 


REVIEW PROBLEMS 


Ness County, Kansas 


(Data furnished by W. H. Pine, Department of Economics and Sociology, Kansas 
State College.) 


p = 550 acres 


Frequency, Percentage of 
rie. Total Acreage 
1.00 1.00 

-98 .88 
.97 .86 
.96 .88 
.94 -78 
.92 -73 
.89 .67 
.83 .57 
.67 .37 
.48 .21 
.22 .06 


md = 420 acres 
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TABLE B. Yzans or SCHOOLING COMPLETED BY Kansas Resipents AT LEAST 
25 Years OLD on APRIL 1, 1940 


Years 
Completed 
None 
Grade 1 
2 
3 
4 
5 
6 
TÉ 
8 
High School 
Year 1 
2 
3 
4 
College 
Year 1 
2 
3 
4 
At least 5 


Total 


Number, 
f 
11,975 
2,136 
5,507 
14,833 
29,745 
33,628 
45,722 
54,326 
388,512 


62,173 
61,935 
31,315 
173,580 


29,113 
32,374 
12,973 
35,947 
12,580 


E 


1,037,774 
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CHAPTER 5 


Sampling from 
Binomial Populations 


When a population of numerieal measurements involves so much 
data that it is either impossible or unwise to attempt to analyze the 
whole of it, sampling must be relied upon to furnish the desired 
information. As a matter of fact, most of the statistical analyses 
now performed involve sampling data. A multitude of examples 
could be sited to illustrate the need for sampling, but the following 
will suffice for the purposes of this discussion. 


(5.01) Public opinion polls. Only a small percentage of the per- 
sons eligible for interview actually are questioned about the matter 
under study. The sole objective of the study is to estimate the pro- 
portions of the citizens favoring the various points of view. If the 
question to be asked has only a yes or à no answer the results of the 
Doll will constitute a sample from a binomial population, and we 
would be attempting to estimate p. 

(5.02) A study of the toxicities of two insecticides conducted by 
Spraying insects of a certain species with the insecticides and count- 
ing the dead insects. This is another case of sampling a binomial 
Population; but the purposes of the investigation may be different. 


The following question is to be answered: Is one of the sprays more 


toxic to these insects than the other? Statistically, the question 
able to suppose that the two sets of data ob- 
les from the same binomial 
include the estima- 


becomes: Is it reason 
tained with the two sprays are samp 
population? Of course, such a study also may 
tion of p as mentioned in (5.01). 


(5.03) Testing the breaking stre : . id 
tnaod ar of metal beams and. ef other engineering materials. Break- 


ing strengths are measured on a continuous scale of numbers; hence 
their populations have continuous frequency distributions. Problems 
113 


ngths of concrete columns, of 
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of this sort include the estimation of true average breaking strengths, 
and comparisons of the strengths of different materials. 

(5.04) Studies involving two variables such as prices of selected 
stocks and the volume of production of finished steel, ACE score 
and grade average in college, stand counts of wheat and the yield of 
a plot, etc. In such investigations it would be necessary to estimate 
from sampling data the relationship between the two variables, ex- 
press it mathematically, and then use it in accordance with the pur- 
poses of the investigation. 


It can be seen from the examples above that two general types of 
statistical problems must be considered in sampling studies. One is 
to derive from the sample observations some numbers which can 
be used satisfactorily in place of one or more unknown population 
parameters. These numbers which will be derived from the sample 
are called sampling estimates of the parameters. They are change- 
able from sample to sample and, being dependent upon chance events, 
are subject to the laws of probability. 

The other general problem is to test hypotheses regarding pop- 
ulations against actual sample evidence. For example, if the popula- 
tions of the breaking strengths of two types (different shapes, for 
example) of concrete columns each follow a normal frequency dis- 
tribution with the same variance, c?, these populations can differ 
only in their means pi, and po. That is, it is supposed that the en- 
gineers in charge are satisfied that the two types of columns have 
the same uniformity of performance from test to test, but it is yet 
to be decided whether they have the same average strength. If so 
(that is, if pa = 12), the populations of breaking strengths are iden- 
tical normal populations. It then becomes a problem of deciding 
from samples taken from each population whether or not p, is in 
fact equal to p». It usually is convenient statistically to assume 
that pı does equal ys, and then to see how reasonable this hypothesis 
is in the light of sample observations. 

It should be clear—intuitively, at least—that decisions based on 
samples may be in error, and that we do not know in any particular 
case if our sample is so unusual that it is misleading us. How, then, 
can sample evidence become a satisfactory basis for making decisions 
about populations? The answer lies in the fact that, while no one 
can say whether a particular decision is right or wrong, it is possible 
to determine the relative frequency with which correct decisions will 
be made over the long-run of much experience if we are following 
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certain rules for acting upon the basis of sampling evidence. It 
follows that the probability of making a correct decision from any 
specified future sample (say the next one we are going to take) also 
can be stated. 

To illustrate some of the preceding discussion, suppose you are 
about to engage in a coin-tossing game in which “heads” is the 
event which is of particular interest to you. Assume, also, that you 
are not satisfied that the coin is unbiased but are not going to worry 
about bias unless the probability of heads is as low as 1/3. Before 
playing the game you are going to flip the coin 15 times and then 
come to a decision regarding the bias of the coin. What rules for 
action should you adopt and how effective will they be in detecting 
bias as bad as p = 1/3? It is being assumed that you are not enter- 
taining the possibility of bias toward too many heads. 

As long as the coin has two sides and one is heads, the other tails, 
any result from 0 to all 15 heads can occur on 15 flips regardless of 
bias in the coin. However, it should be clear that the relative fre- 
quencies of occurrence of the 16 possible results are dependent upon 
the size of p. For p = 1/3, for example, such a result as 15 heads 
on 15 throws is an extremely rare occurrence. The actual rarity, 
in terms of probability, can be derived from the binomial series for 
(q + p)”, with p, q, and n given. 

If p = 1/2 and n = 15, the binomial series is 


(1/2 + 1/2)!5 = .000 + .000 + 003 + .014 + .042 + .092 + .158 
P: 0 1 2 3 4 5 6 


+ .196 + .196 + .153 + .092 + .042 + .014 + .003 
r: 7 8 9 10 11 12 13 


+ .000 + .000. 
F: 14 15 
e has produced r = 0, 1, 2, or 3 heads 
ld be very reluctant to accept 


the total probability of the 
exclusive events is but .017, 


When p is unknown and a sampli 
on 15 random flips, you probably wou 
the hypothesis, Ho(p = 1/2) because 
occurrence of one of these 4 mutually 
or about 1 chance in 59. Although it is true that one of those 4 
results can be obtained when the coin is unbiased—and you knew 
this before you tossed the coin 15 times—you are now faced with 
the necessity to decide if the coin is biased or not, and you must do 
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so upon the basis of the sample's evidence. If you decide to reject 
the hypothesis that p — 1/2 whenever the observed number of heads 
is one of the 4 cases just listed, you will unjustly reject Ho 1.7 per 
cent of the time because that is how frequently such cases occur by 
chance when p does equal 1/2. Nevertheless, some rules for action 
must be adopted or else nothing can be decided from samples. Hence, 
it will be supposed that the following rules will be followed after 
15 sample tosses of the coin in question: 


(a) If r=0, 1, 2, or 3 heads, you will reject Ho(p = 1/2) and 
assert that the coin is biased against heads. 

(b) If r2 4, you will accept Ho and play the game on the assump- 
tion that the coin is not biased against heads. 


These two rules can lead you to correct conclusions and 


actions, 
and they also can cause you to make one of two kinds of err 


ors: 


(1) The hypothesis Ho(p = 1/2), which is being tested by sam- 
pling, may be rejected when it is true. This will be called an error 
of the first kind. In the above example, the probability that such 
an error would occur was noted to be 017 under the rules a and b. 

(2) The Ho may be accepted when it is false. This will be ealled 
an error of the second kind. It should be clear that the likelihood 
of committing an error of this kind depends on what possibilities— 
or alternative hypotheses —there are. 


ability that an error of the second kind w 
nated by 8) is as small as possible under t} 

Referring back to the coin-tossing problem, we see that a = 017. 
Also, the person who was trying to decide from 15 throws if the coin 
Was seriously biased would not care if p had some size between 1/2 


and 1/3, but did wish to detect a p as low as 1/3. Hence the alterna- 
ce Res Whose truth could lead to errors of the second kind 
Includes all p’s at or below 1/3. For the sake f si ieity it wi 

eats dee se of simplicity it will be 


only alternative hypothesis t - i 
inest o Ho(p = 1/2) is 
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The 8 can be determined from the following series: 


(2/3 + 1/3)? = .002 + .017 + .060 + .130 + .195 + .214 + .179 


r: 0 1 2 3 4 5 6 
+ .115 + .057 + .022 + .007 + .002 + .000 + .000 

py: T 8 9 10 11 12 13 
+ .000 + .000. r 


p: 14 15 


Hence if p actually is 1/3 so that the hypothesis of no bias should 
be rejected, the probability is .002 + .017 + .060 + .130 = .209 that 
H will be rejected. Or the probability that Ho will not be rejected 
when it should be—an error of the second kind—is 8 = 1 — .209 = 
.791. Obviously, the rules a and b would not be good ones if it is 
serious to fail to detect the bias indicated by p — 1/3. However, if 
the most serious mistake is to accuse someone of employing a biased 
coin when he is innocent, rules a and b may be quite satisfactory. 

In practice we seldom can compute f as simply as above. Usually 
the a is set at an appropriate level and then standard tests are em- 
ployed without actually knowing the 8. However, it can be said 
here that the tests to be discussed in this, and the next, chapter have 
been chosen with the idea of making the 2 as small as possible under 
the circumstances and for the chosen a. 

As the heading of this chapter indicates, the subsequent discussion 
will be confined to samples from binomial populations. Later chap- 
ters will take up the normal and the two-variable situations. 


51 OBTAINING THE SAMPLE 


Before a method for obtaining the sample is devised, the popula- 
tion which is to be sampled must be defined clearly. It is recalled 
from Chapter 4 that a binomial population is possible only if the 
units in some definable group have attributes which may be described 
by just two classes. Moreover, the fractional part of the population 
falling into each class must stay fixed. For example, all the farmers 
in Finney County, Kansas, on July 1, 1953, could be classified un- 
ambiguously into two classes as regards membership in some co- 
operative association: those who do belong to some cooperative and 
those who do not belong to any such association. The units would 
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be the individual farmers’ designations as member or as non-member. 
The fraction of the total number of farmers in that county who were 
classed as members could be the parameter p of Chapter 4. Then 
1 — p would be the fraction who were classified as non-members on 
the stated date. $ 

It is entirely possible for those same farmers to be the basis for 
other populations. Their answers for or against a proposed new 
federal farm policy could constitute another binomial population 
of interest, their per-acre incomes during a specified period could 
be another (non-binomial) population, and the sizes of their families 
on July 1, 1953, could be still another (non-binomial) population 
which might be of interest to some group of persons. 

The chief criterion of a good definition of a population which is 
about to be sampled is that it make entirely clear in all important 
respects the larger group of units to which the conclusions drawn 
from the sample will pertain. 

Given a well-defined population, the sample obviously must be 
taken in such a manner that the impression it produces through 
statistical analyses will have the greatest possible chance to be 
accurate and dependable. Naturally the facilities and economic re- 
sources available for the sampling may be limiting factors; but it 
will be assumed in the discussion to follow that those resources and 
facilities are at least good enough to justify undertaking the sam- 
pling study at all. For purposes of illustration, suppose that we wish 
to determine publie opinion in a large city regarding a politieal issue 


of current interest, and that our resources allow us to interview only 


How should this 
it the few major 
ber of interviews 
rsons are concen- 
; but we have no 

€ would meet there are 
the same as those we would find in the outlying districts, for example. 
irectory until we thought 
ve telephones. If we are 
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drawing numbers at random until a sample of the desired size was 
obtained. Such a procedure would make it true that every possible 
sample of the size n had been equally likely to be drawn at the outset 
of the sampling, and this is necessary in random sampling. 

In many circumstances the procedure of sampling just outlined 
would be unsatisfactory. A city might be made up of racial and 
economie groups of such diverse opinions on the matter being studied 
that it would not do to leave their representation in the sample to 
pure chance, as in the random sampling just described. It would be 
necessary to sample each group in aecordance with its proportionate 
part of the city's total registration of voters. 

It is noted that the sampling discussed above has supposed that the 
sample will be taken by means of personal interviews. Any such 
Systems as calling persons on the telephone or mailing questionnaires, 
which depend on voluntary and selective responses, or on their being 
at a certain place at a certain time, are almost certain to produce 
biased samples. The cause of their not responding, and hence not 
being in the sample, may be associated with the type of response 
they would have given. 

The theory and techniques of sampling in such a way that the 
conclusions which ean be drawn therefrom will be accurate and re- 
liable are very extensive and cannot be covered here. The remarks 
above merely point out a few of the more important and general 
considerations. However, the reader can be warned to be critical 
of any conclusions drawn from samples until he is satisfied that the 
samples were taken in such a way that they should be representative 
of the population about which conclusions have been drawn. If one 
brand of cigarettes is said to be the favorite of a certain professional 


group, we should at least wonder if that group was properly sampled. 


Or, if someone returns from a foreign country and asserts that the 


residents of that country hold certain points of view regarding a 
matter of world-wide interest, we should wonder if he did an ade- 
quate job of sampling publie opinion in that country. Or, as a final 
example, if someone seeks to obtain a sample by means of a mailed 
questionnaire, we should wonder if those who do not respond have a 
different opinion, say, from that generally expressed by those who 
did return their questionnaires. If so, what population did those 


Who returned their questionnaires represent? 
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PROBLEMS 


1. Suppose that you were sent out to ascertain the public opinion ina cer- 
tain community regarding the necessity for flood control of a certain type in 
that area. How would you obtain your sample so that it would be representa- 
tive of the whole community? ) 

2. Referring to problem 1, would it make any difference in the manner in 
which you took your sample if it were taken in July of 1936 during a severe 
drouth or in July of 1951 right after a record flood? Justify your answer. 

3. Suppose that a roving reporter goes into a city with the intention of 
ascertaining public opinion on a matter of foreign policy. He is going to walk 
about the streets asking persons at random a specifie question requiring one 
of three answers: Yes, No, or No Opinion. Will it make any difference what 
hours of the day, between 7 a.m. and 6 P.M., he does this? Would the day of 
the week matter? Would the type of city—industrial, college site, farming 
community, rich suburb, and the like—have anything to do with the answers 
to these questions? 

4. Suppose that a company which is manufacturing candies develops a new 
product whose originators believe is especially good. Which of the following 


possible ways of testing the public’s reaction to this new 


confection would you 
prefer to use? Why? 


(a) Sit back and see how the sales go. 


(b) Have some trained persons take samples out to the public and ask peo- 
ple to taste the candy, 


to record their reactions, and to give these records to 
the field representatives directly. 


(c) In each of the first 10,000 packages manufactured, place a stamped and 
addressed card requesting that the purchaser record his opinion of the candy 
and mail the card to the company. 


(d) Ask a panel of expert candy tasters to decide the matter. 

(e) Do as in d, first, then a. 

(f) Do as in d, first, then b. 

(9) Do as in d, first, then c. 

(h) Have all the firm's employees record the: 
decide from these records if mass production is 

(i) Combine h and a. 

() Combine h and b. 

(k) Combine h and c. 

(D) Combine others above. Specify. 

(m) Specify another method if you have one you prefer, 


ir opinions of the candy and 
wise. 


5. Suppose that some engi 
types of structural beams, e; 
pany. Which of the followi 
the engineering group h 
available for the tests? 


ngth of two 
a different com- 
you recommend if 
perating personnel, 


(a) Ask each company to 


send a specified number 
have each laboratory test h 


of beams for testing and 
alf of each company's pro 


duet. 
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(b) As in a, but have one laboratory test all one company's beams, the sec- 
ond laboratory testing all the second company's beams. 

(c) Go into the publie market and purchase the necessary number of beams 
of each type, and then do as in a. 

(d) As in c, but replace a by b. 

(e) Specify other ways. 


6. An agronomist wishes to run critical yield, protein, and test weight studies 
on a proposed new variety of wheat before the variety is released to the public. 
He proposes to use a standard and widely planted variety for comparison with 
the new one. Plenty of land is available for this study, but it is quite non- 
uniform in soil qualities, moisture content, and exposure to weather. Which 
of the following outlines for such a study would you prefer, and why? 


(a) Plant the new variety on the east half of the available land, the standard 
variety on the west half (or vice versa, as decided by flipping a coin), harvest 
and measure wheat from each half, determine test weight and protein content 
on the yield from each half separately. 

(b) Divide the available area into 20 equal-sized plots and plant 10 plots to 
each variety, choosing the variety for a plot by drawing the names from a hat. 
Then determine yield, protein, and test weight separately from each plot’s 
wheat, 

(c) Do as in b, except that the plots are grouped into 10 pairs and each pair 
has both varieties planted side by side. 

(d) Save the land for some other purpose, send out samples of each wheat 
to 10 farmers, and ask them to report the yields and test weights and send in 


Samples for protein analysis. 


52 CALCULATION OF POINT AND INTERVAL ESTI- 
MATES OF p FOR A BINOMIAL POPULATION 


It was indicated in Chapter 4 that a binomial frequency distribu- 
tion can be defined when individuals are identified only as belonging 
to one of two possible classes of attributes such as male or female, 
able product or unacceptable product, and the 


dead or alive, accept 
falling into the two classes of at- 


like. Moreover, the proportions 
tributes are constantly p: (1 — P)- 
lf n members of a binomial population are selected at random, 
the partieular individuals drawn are the result of chance occurrences. 
Hence, we may find that any number from r = 0 to r = n of those 
individuals possess the attribute A, say, even though a fixed propor- 
tion, p, have that attribute in the whole population. The possible 
Outcomes of such a sampling vary from r = 0 to r = n and form a 


binomial frequency distribution with mean p. = np and with standard 
deviation ¢ = pg, as was shown in Chapter 4. The reader is 
y r of the n members of 


reminded that the probability that exactl 
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the sample will possess an attribute with probability of occurrence 
= p for any specified future trial from the population is given by 
the formula: C,,,-pr(1— p)"=". 

The point of view in the preceding paragraph is that of Chapter 4 
in which the size of p was assumed to be known. More commonly, p 
is not known and we have only a sample estimate of its size. This 
estimate is r/n, which varies under repeated sampling from 0 to 1. 
Even though r/n is a variable quantity, useful and reliable conclu- 
sions can be drawn from samples taken from a binomial population, 
as will be shown shortly. Three types of such conclusions will be 
considered in this chapter: (a) Given a sample, what can we say 
about the size of p? (b) Given a sample from a binomial distribu- 
tion, how well does it agree with a predetermined hypothesis con- 
cerning the magnitude of the p for that population? (c) Given two 
random samples, did they probably come from the same binomial 
population? The present section is concerned with question a. 

When the true proportions of the two types of members of a 
binomial population are not known, they can be estimated by means 
of a sample, as suggested above. This estimation can take either 
of two forms: (a) a point, or specific, estimate of p, which would be 
used in lieu of the p, or (b) an interval estimate which would have 
a preassigned probability of bracketing the size of p. This latter 
process is called placing a confidence interval on p. The confidence 
we can have that the bracket, or interval, does actually include the 
unknown parameter is described by the confidence coefficient. 

Statistical research indicates that the best point estimate of p is ob- 
tained from f = r/n, the observed fraction of the sample which pos- 


sess the particular attribute that is being studied. Some of the reasons 
for this decision are: 


(5.21) The f has an expected value E(f) = E(r/n) = E(r)/n = 
np/n = p for any particular sample size, n. That is, the long-run av- 
erage size of f is exactly equal to the true population parameter p. It is 
customary to call point estimates unbiased estimates if their mathemat- 
ical expectation is the parameter which is being estimated. We gen- 
erally prefer to employ unbiased estimates, like $, unless some more 
important property is missing. 

(5.22) The estimate f = r/n has a variance = pq/n because the 
variance of r is npg—as shown in Chapter 4—and th 


i cur e effect of dividing 
the r by n is to divide the variance by n?, as was 


l 1 shown in the section 
of Chapter 2 which dealt with the coefficient of variation. This vari- 
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ance, pg/n, of the estimate f will be quite small for a sample of almost 
any useful size because the p and the q are-each less than unity. This 
indicates that $ will not vary greatly from sample to sample, especially 
if the sample size is fairly large. As a matter of fact, the size of the 
variance of # can be made as small as desired by taking the n suffi- 
ciently large. Hence, this estimate, $, is considered to be a very 
efficient estimate of p. 

In view of the fact that $ is almost always in error to some degree 
in spite of the fact that it is the best point estimate possible, there are 
many circumstances in which an interval estimate of p is desirable. 
The interval estimate also is more difficult to compute and to inter- 
pret; hence it will be considered in some detail. 

The situation is this: n members of a certain binomial population 
have been drawn at random so that each member of the population 
had an equal opportunity to be in the sample, and r of them have 
been found to have the specified attribute A. Given the proportion 
r/n observed in the sample, what useful limits can we place on the 
true proportion, p, of A members in the whole population, and what 
confidence can we have in those limits? It is customary to call such 
interval estimates confidence limits, or to say that these limits con- 
stitute a confidence interval. The degree of confidence which we can 
place in such limits on p is measured by the probability that the 
sample has given an interval which actually does include p. As might 
be expected, this probability is the relative frequency with which the 
Sampling process used will produce an interval which does include p. 
It will be convenient to use the symbol Clos, for example, to designate 
the confidence interval which has—at the start of the sampling proc- 
ess—95 chances out of 100 of including the parameter which is being 
estimated. 

Suppose that a relatively small manufacturing concern is produc- 
ing roller bearings which are to be shipped to a larger company 


manufacturing farm machinery. There will be certain specifie stand- 


ards, such as maximum or minimum limits on diameter, which the 


bearings must meet before they are considered to be acceptable 
Products. Hence, any large batch of bearings could be grouped into 
two subgroups marked as “acceptable” and “unacceptable, respec- 
tively, if every bearing were to have its diameter measured with 
perfect accuracy. It will be assumed here for simplicity of discus- 
h is to receive the bearings requires that 


sion that the company whic 1 
ae “acceptable” or it can be re- 


each Shipment must be 90 per cent i s 
jected. The concern which is producing the bearings will have to 
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inspect its produets by means of samples because it is inconceivable 
that every bearing should be carefully measured. 

Assume that a sample of 10 bearings has been inspected and that 
all 10 were found to be acceptable. Is this sufficient evidence that 
the shipment probably is up to the standard? In this connection, 
consider a binomial population with p only .80; that is, it is well 
below the standards set above. The probability that every member 
of a sample of 10 will be acceptable is (.80)!9, which is .11; hence 
there is about 1 chance in 9 that this definitely substandard batch 
of bearings will show none unacceptable on a sample of but 10. Ob- 
viously, if p were less than .80, p!? would be less than .11; and, con- 
versely, if p were larger than .80, p!? would be greater than .11. 
"Therefore, it should be clear that the result, 10 acceptable bearings 
out of 10 inspected in a sample, could be obtained from any one of 
& whole range of possible binomial populations corresponding to 
values of p ranging from 0 to 1. As a matter of fact, the sample 
discussed above could be drawn at random from any binomial pop- 
ulation with as many as 10 acceptable bearings among the individ- 
uals. Of course, with n = 10, a sample with r also equal 10 is more 
likely to come from a population with p near 1 than from a popula- 
tion with p near 0. 

The above discussion re-emphasizes the fact that we cannot attain 
certainty in conclusions drawn from samples: there always must be 
some risk that the sample has led to a false conclusion. We choose 
a risk of error which we can afford to take and express it in terms 
of the confidence coefficient described earlier. If it be supposed that 
an event which is as unlikely to occur as 1 time in 20 can be ignored, 
what confidence interval (Clos) ean we set on p as a result of the 
above sample in which r = 10 acceptable bearings out of 10 observed 
in the sample? 

We use what will be called a central 95 per cent of all possible r’s 
by determining a range on r which is such that not more than 214 
per cent of all samples with the same n and p will fall beyond each 
end (separately) of the range so determined. For example, in the 
series below for (1/4 + 3/4)! the first five terms—to the left of 
the brace—add to .0197, which is less than 2% per cent, or .0250. If 
the sixth term from the left is added, the sum exceeds .0250. There- 
fore, among all possible samples of 10 obsery 
Veron n e bs ro p= = the sample number, r, will be 
other ebd of the in pace e nis a w^ inclu 

b series for (1/4 + 3/4)1? no term is less than or equal 
to .0250; hence, the “central 95 per cent” will be occupied by samples 


ations from a binomial 
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for which r — 5, 6, 7, 8, 9, or 10. Consequently, if you have drawn 
a sample with n — 10 and p is unknown, it is quite unlikely that p 
was as large as 3/4 if it was found in the sample that r = 0, 1, 2, 3, 
Or4. As a matter of fact, you could just form the habit of assuming 
that p never is as large as 3/4 whenever r is 0, 1, 2, 3, or 4 and you 
would be wrong less than 5 per cent of the time because samples of 
that sort occur less than 5 per cent of the time when n = 10 and p is 
as large as 3/4. 

To answer the question posed earlier, we consider the following 
reasoning. If » — 10, the probability series for p set successively 
equal to 2/3, 3/4, .69, and .70 (for reasons which will appear soon) 
are obtained as in Chapter 4 and lead to the following conclusions: 


(1/3 + 2/3)!? = .0000 + .0003 + .0031 + .0163{+ .0569 + .1366 
(sum = .0197, is <.0250) 
ms 0 T 2 E] 4 5 


+ .2276 + .2601 + .1951 + .0867}+  .0178 
is <.0250 


fi 6 7 8 9 10 
(Note that the “central 95 per cent" does not include the observed 
number of occurrences, r = 10.) 


(1/4 + 3/4) = .0000 + .0000 + .0004 + .0031 + .0162{+ .0584 
(sum = .0197, is <.0250) 


ye 0 1 2 3 4 5 


+ .1460 + .2503 + .2816 + .1877 + .0563] none ex- 
cluded 


$i 6 7 8 9 10 


(The central 95 per cent does include the sample result, r = 10, but 
still might do so with a smaller p; hence p = 3/4 may be too large 
to be the lower end of the 95 per cent confidence interval. Therefore, 


P = .70 will be tried.) 


(3 + .7)19 = 0000 + .0001 + -0014 + .0090{+ -0368 + .1029 
(sum = .0105, is <.0250) 


p 0 1 2 
+ .2001 + .2668 + .2335 + 1211 + .0282} none excluded 
HH 6 7 8 9 10 


3 4 5 
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(The central 95 per cent still includes the observed number, r = 10, 
and it again is possible that p could be smaller and still keep r — 10 
in the central 95 per cent. Hence, try p — .69.) 


(31 + .69)!? = .0000 + .0002 + .0018 + .0108{+ .0422 + .1128 
(sum = .0128, is <.0250) 


n ££ 0 1 2 3 4 5 
+ .2093 + .2662 + .2222 + .1100) +  .0245 
is <.0250 
Ti 6 7 8 9 10 


(The central 95 per cent now just barely excludes the observed num- 
ber, r = 10; therefore, the smallest value of p which has been con- 
sidered here and which still keeps 7 — 10 within the central 95 per 
cent is .70. However, it is clear that if three decimal places were 
used, the lower end of the confidence interval would be nearer to 
.69 than to .70; hence, .69 is taken as the lower end of the 95 per 
cent confidence interval.) 

To determine the upper end of the 95 per cent confidence interval, 
it is necessary to find out by a similar procedure how large p can 
become and still leave the observation, r — 10, in the central 95 per 
cent of the binomial population with n — 10. Obviously, p can go all 
the way to 1.00, or 100 per cent, and still not exclude the case when 
r = 10; hence, p = 1.00 is the upper limit of the 95 per cent confi- 
dence interval when r has been found to be 10 when n — 10. There- 
fore, it is concluded that if with n = 10, r is observed to be 10 also, 
the 95 per cent confidence interval on the true percentage in the 
population is .69=p=1.00. At the same time, the person doing 
the sampling is aware that there are 5, or less, chances in 100 that 
his sample has been sufficiently “wild,” or unusual, that it has pro- 
duced a confidence interval which fails to include the tr 
tion, p, of acceptable products in the population which was sampled. 

The work done above is illustrative of the principles involved but 
is too laborious to be repeated each time a confidence interval is 
needed, especially when m > 10. Therefore, advantage is taken of 
some work done by C. J. Clopper and E. $. Pearson, published in 
Volume 26 of Biometrika. Table 5.21 was obtained by reading from 
their graphs the 95 and 99 per cent confidence intervals on p for 
n = 50, 100, and 250. If n is smaller than 50, the confidence intervals 
are so wide that they are of doubtful value in practice. However, 


ue propor- 
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TABLE 5.21 


Tue 95 AnD 99 Per Cent CONFIDENCE LIMITS ON p FOR SAMPLES or 50, 
100, AND 250 TAKEN From BINOMIAL POPULATIONS 
(L = lower limit) 
(Based on graphs by C. J. Clopper and E. S. Pearson, Volume 26 of Biometrika.) 


n = 50 n = 100 n = 250 


95% 99% 95% 99% 95% 99% 


r/n L U | & U|L | E Gi e) U |G U 


0 8 0 | 10 0 4 0 5 0 2 0 2 
0 12 0 | 15 0 8 1 9 1 5 1 6 
1 16 0 19 2 | 1L 1 13 3 9 2 | I0 
2 19 1 | 22 3| 14 e ae 411 4 | 13 
3 | 22 2 | 26 5 | 18 4 | 20 6 | 15 5 | 16 
5 | 25 8 | 29 6 | 21 5 | 23 & | 17 SIS 
6 | 28 4 | 32 8 | 24 7 | 26,1] 11.20 9 | 22 
8 | 31 6 | 35 | 10 | 26 8|29]| 13 | 23 | 11 | 25 
10 | 34 7 | 38 | 12 | 22 | 10 | 32 | 15 26 | 13 | 27 
12 | 37 9|4 15 | 32 | 12 | 35 | 18 | 29 | 15 | 30 
14 | 39 11 | 44 | 17 | 35 | 14 | 38 | 20 31 18 | 33 
16 | 42 12 | 47 | 19 | 38 | 16 | 40 22 | 34 | 20 | 35 
18 | 45 15 | 49 | 21 | 40 | 18 | 42 | 24 36 | 22 | 88 
20 | 47 17 | 52 | 23 | 42 | 21 | 45 27 | 39 | 25 | 40 
20 | 41 | 27 | 43 


50 | 7 81 | 55 | 1 be 
io as ss | 7 | s| @ | ot] ¡2 (217 

T 5 0 | 79 | 57 5 76 j; T 
55 | 82 | 51 | 85 | 00 | si | Go | s3 | 66 | 78 | 65 | 80 
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'Table 5.21a has been added to show, through the preceding discus- 
sions, just how the numbers in Table 5.21 could be got. Obviously, 
if n were as large as 50 the work illustrated above would become 
tremendously laborious. 

In Table 5.21 the observed fraction, r/n, was used instead of r 
because it was convenient to do so. 


TABLE 5.21a 


95 Per CENT CONFIDENCE Limits WITH n = 10 


E 


U y L U T L U 
0 Q9 3 4 11 75 8 44 97 
1 0 46 5 18 82 ^ 9 54 100 
2 2 57 6 25 89 10 39 100 
3 6 606 7T 34 94 


The use of Table 5.21 will be illustrated by some examples. 


Problem 5.21. Suppose that a random sample of 250 primers for cartridges 
has been taken from a large batch and has been tested by actual firing. If 6 
of the primers fail to fire, place a 99 per cent confidence interval on the true 
percentage of duds in the whole batch, and interpret these limits. 


For this sample r/n = 6/250 = .024, which is so near to the value 
of .025 listed in Table 5.21 that interpolation is unnecessary. There- 
fore, the required confidence interval is read from the table as 1 to 6 
per cent duds in the whole batch. If future action regarding these 
primers is based on the assumption that at least 1 per cent but not 
more than 6 per cent of them are duds, a risk of only 1 in 100 is 
being run that the sample has been misleading. If 6 per cent is more 
than the allowable proportion of duds, this sample indieates that the 
bateh may be substandard. Whether the primers would be rejected 


or additional evidence obtained would depend upon the partieular 
circumstances. 


Problem 5.22. Suppose that a concern which manufactures roller bearings 
must meet a standard of 95 per cent acceptable according to certain prescribed 


measurements. If a sample of 250 yields 3 unacceptable bearings, is the ship- 
ment up to the required standard or not? 


In this instance r/n = 012, so the 95 and 99 per cent confidence 
intervals are found to be 0 to 3, and 0 to 4, respectively, by interpola- 
tion in Table 5.21. Therefore we could conclude that the shipment 
has less than 5 per cent unacceptable with considerable confidence 


Sec. 5.2 POINT AND INTERVAL ESTIMATION OF p 129 


because even the upper limit of the 99 per cent confidence interval 
on the true proportion of duds is below 5 per cent. 

The procedures demonstrated above are not suggested as sufficient 
quality control measures in themselves, but they do illustrate prin- 
ciples which are basic to acceptance sampling. 


PROBLEMS 


1. Suppose that 100 bolts have been taken at random from a large group 
and that 2 have been found to be defective. What is the 99 per cent confidence 
interval on the true proportion of defectives in the group sampled? 

2. Suppose that a sample of 250 Germans showed that 101 had type O blood. 
Place 95 per cent limits (to nearest per cent) on the percentage of such Ger- 
man persons having type O blood. Ans. 34 to 46. 

3. The little fruit fly, Drosophila melanogaster, has been used so extensively 
in genetic research that a great deal is known about the genes which it carries 
on its chromosomes. Among these genes are some which produce what are 
called recessive lethals because they kill the potential offspring at an early 
stage of development if both chromosomes carry the gene for that particular 
lethal. Mating studies are able to show if only one chromosome of a fly carries 
a particular lethal-producing gene. Suppose that a sample of 250 flies is found 
to include 10 which are carrying one particular lethal. What can you say about 
the true proportion of lethal-carrying flies in this population? E 

4. Suppose that two different strains of fruit flies have been developed in a 
laboratory upon the basis of the numbers of eggs that the females laid per day. 
Suppose also that a particular recessive lethal, l}, has been discovered in both 
strains; and that samples of 250 flies from each strain gave these results: strain 
A had 18 lethals, strain B had 32 flies carrying lethals among the 250 examined. 
What can you conclude about the true proportion of lethal-carrying flies in 
each strain? Are these two proportions probably equal? . 

5. Suppose that 50 apples have been selected at random from a tree which 
has a very large number of apples. If 5 apples were found to suffer from a 
certain blight, what pereentage of blight do you estimate for the whole tree if 
you wish to run a risk of only 1 in 20 that your answer is wrong as à result of 
an anomalous sample? Do likewise for a risk of only 1 in 100 being in error. 

6. Suppose that 100 eggs are selected at random from a large shipment, and 
that 5 are found to be stale. What would you set as the upper limit on the 
Percentage stale in the whole shipment if you can afford a risk of sampling 


error of only 1 in 100? Ans. 13. 

7. If a sample of 250 gun barrels in a large shipment has been examined for 
defects and none found to be defective. place 99 per cent confidence penis on 
the true proportion of defective barrels in the whole shipment. Would a 


Sample of 250 be large enough if the shipment must contain one per cent, or less, 
defective? 

8. If 250 one-pound cartons of butter are to be selected from a carload at 
random and examined for mold particles, what is the maximum number which 
can be found to contain too much mold before you should conclude that 5 per 
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cent, or more, of the cartons probably contain too many mold particles? Use 
a 99 per cent confidence interval as the basis for your answer. Ans. 2h 

9. Suppose that 100 cattle selected at random from a very large group have 
been tested for tubereulosis. If 15 were found to be reactors, place an upper 
limit on the proportion of reactors in the whole group if 95 per cent confidence 
in the answer is considered adequate in these circumstances. 

10. The United States Department of Agriculture publieation, Agricultural 
Statistics, 1946, indicates that among United States herds of cattle which are 
infected with Bang's disease at all, an average of 12 per cent of the cows have 
the disease. Suppose that a large herd which has some incidence of the dis- 
ease is to be tested by taking a random sample of 50 cattle. How many out 
of the 50 must be free of the disease before the owner can be assured (at the 
99 per cent level of confidence) that his herd is above average in freedom 
from Bang’s disease? Ans. 46. 

11. Calculate directly from the binomial series (q + p)! the 75 per cent con- 
fidence interval on p if 3 out of 4 items sampled are found to be acceptable in 
the sense employed earlier. Obtain the answer to the nearest whole per cent. 

12. Verify the entry in Table 521a for r = 5. 

13. Suppose that an entomologist wishes to know 


what percentage of the 
corn plants in a large field h 


ave been infested to some degree by the southwest 


corn borer. He thinks that the percentage is somewhere between 20 and 80, 
but he wants to reduce that uncertainty to an interval of not over 15 percent- 
age points. 


Tf he is willing to accept a risk of 5 in 100 of dr. 
conclusion, how large a sample must he take? 


14. Suppose that you are helping to administer a farm management associa- 
tion and wish to learn what percentage of the members use a certain procedure 
recommended for poultrymen. Suppose, also, that a random sample of 250 
interviews reveals that 200 in the sa 


ample do use the recommended practices. 
What can you say about the true p 


ercentage using this practice in the whole 
association? Ans. Cly;: 74-85% s 
Gii 72-86% use practices. 


awing an erroneous 


5.3 TESTING PREDETERMINED HYPOTHESES 
REGARDING p 


In some fields of investigation the probable magnitude of p can 
be deduced from what appear to be reasonable theoretical considera- 
tions, as was illustrated in the discussions of the A-B, and other, 
blood groups. As another illustration, the theory of sex inheritance 
might lead geneticists to conclude that male and female offspring of 
human beings should be produced in equal proportions. If 80,9 = 1/2 
when children are classified merely as male or female. Abundant 
statistical evidence now exists to show that more than one-half the 
children born in the United States are male; therefi 
hypothesis that p = 1/2 is known to be false. 
cannot afford to wait many years until the co 


ore, the original 
However, mankind 
llection of a great 
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volume of data makes it possible to determine, virtually without 
error, the truth or falsity of many of the hypotheses which play im- 
portant roles in everyday life and in scientific investigations. In 
these circumstances samples can be taken and made the basis for 
satisfactory conclusions. 

The statistical methods needed for a test of a predetermined hy- 
pothesis regarding some binomial population are intended to decide 
whether or not it is reasonable (as defined by an accompanying prob- 
ability statement) to suppose that a given sample actually has been 
drawn from the binomial population which is specified by the hy- 
pothesis being tested. In order that such a decision can be made, a 
basis must be established for comparing a particular sampling result 
with results to be expected from sampling if the hypothesis being 
tested is strictly correct. How should this be accomplished? Actu- 
ally, the problem is a very complex one whose full solution cannot 
be attempted at the reader’s present stage of statistical development; 
but some useful and informative rationalizations can be presented. 

Strange as it may scem, a large part of the complexity of this 
problem comes from the fact that there are so many possible solu- 
tions that the more difficult job is to choose the best one. This was 
indicated in the introductory part of this chapter. In that intro- 
duction a rather simple example was considered and a hypothesis 
Was judged for reasonableness by means of the binomial expansion 
(a+ p)". It was possible with the aid of that expansion to say 
that if p = 1/2 only 1.7 per cent of a large number of random sam- 
ples would have r as small as 0, 1, 2, or 3. The rarity of such oceur- 
rences was made the basis for rejecting Ho(p — 1/2). Although the 
risk of falsely rejecting Ho when p actually is 1/2 is only .017, the 
likelihood of falsely accepting:Ho when p actually is 1/3 was seen 
to be high; nearly four chances out of five. It was stated in con- 
nection with that example that the choice of the best procedure for 
making decisions from samples depends on this latter probability 
of an error of the second kind because the probability of an error of 
the first kind usually is fixed in advance. er. 

In the example just reviewed, a sampling frequency distribution 
Was employed, and events which fell in the lower frequency intervals 
—that is, the extreme sizes of r—constituted what is called the 
region of rejection. On the scale of measurement of r, the points 0, 
1, 2, and 3 comprised the region of rejection. The general problem 
of choosing best tests of hypotheses regarding population parameters 
Consists of finding functions of the sample observations and of the 


132 SAMPLING FROM BINOMIAL POPULATIONS Ch. 5 


population parameters for which the best regions of rejection can 
be defined. The best region of rejection, among several choices, is 
the one which for a given g will make £ the smallest; that is, for a 
fixed probability of rejecting a true hypothesis it will give the lowest 
probability of accepting a false hypothesis in consideration of the 
other possible hypotheses. 

Statistical research has shown that a good function to use in the 
solution of the problem set up for this section is one which is called 
chi-square, and is denoted by the symbol x”. Its magnitude depends 
upon the numbers of individuals, or other units, observed in the 
sample to fall into each of the possible classes of attributes. It also 


depends upon the numbers which are expected mathematic 


ally to 
fall in those classes, 


which in turn depends upon the predetermined 
hypothesis regarding the population parameter p. For example, sup- 
pose that we have sufficient reason to believe that one-half the off- 
spring of guinea pigs should be males. The predetermined hypothesis 
nowisthatp — 1/2. Ifa sample group of progeny selected at random 
from a whole population of actual or possible progeny is found to 
have 38 males and 32 females, is it reasonable to believe that this is 
a sample from a population for which p = 1/2? The mathematically 
expected number of males out of 70 offspring is E(r) = (1/2) (70) = 
35; hence the number of males in the sample is 3 greater than ex- 
pectation. It follows automatically that there are 3 fewer females 
than expected mathematieally. 


The function y? will be defined by the following formula: 


631) y -5 (observed number in class — expected number)? 


- i 
expected number in class 


where-the summation includes two terms, one for males 
females. It is apparent from this formul 
bers in the two classes agree well w: 
expected mathematically considering 

x? will be relatively small; 


and one for 
a that if the observed num- 
ith those numbers which are 
the assumed magnitude of P, 
but if the numbers observed to fall in 
each class notably disagree with those expected from the predeter- 
mined hypothesis, x! will be relatively large. The decision that x: 
is relatively large or small is based upon the proportion of all such 
sample values of x? which would be at least that large if the hypoth- 
esis being tested were, in fact, true. 


For the illustration above, x? = (38 — 35) 


2/35 + (32 — 35)?/35 
= 0.51. The remaining question 


is: Is it reasonable to suppose that 
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X^ got so large as this purely as the result of the chance occurrences 
of sampling? As was done earlier, attention will be called first to 
some actual sampling experiences, and then a mathematical table 
will be employed to obtain the required information more quickly 
and more accurately. Table 5.31 summarizes the results obtained 
from 652 samples from a population for which p was known to be 1/2. 
Figure 5.31 is the graph of the r.c.f. distribution presented in Table 
5.31. 
TABLE 5.31 


OBSERVED FREQUENCY AND r.c.f. DISTRIBUTIONS FOR x? Waen THE Two 
CLasses, MALE AND FEMALE, DETERMINE THE POPULATION, AND p = 1/2 


Class Interval f Hif: Class Interval f r.cf, 
> 5.50 9 1.000 2.00-2.49 50 .885 
5.00-5.49 6.986 1.50-1.99 48 -808 
4.50-4.99 4 .977 1.00-1.49 70 735 
4.00-4.49 7 .971 0.50-0.99 101 .627 
3.50-3.99 0 .960 0.00-0.49 308 .472 
2 

7 


1 
3.00-3.49 1 .945 > 
2.50-2.99 2 .926 Total — 652 


1.00 
90 
.80 
.70 
.60 
50 
40 
30 
.20 
10 


Relative cumulative frequency 


‘50 1.00 1.50 2.00 2.50 300 3.50 4.00 4.50 5.00 5.50 
x 


with one degree of freedom, as deter- 


Figun sampling distribution of x? 3 3 
e Ee binomial population with p — 1/2. 


mined from 652 samples taken from a 
= 0.51 to the graph of Figure 5.31 and 
scale, it appears that about 52 or 53 


Per cent of all such sampling values of x? would exceed un, gd 
Viously, then, 0.51 is not an unusual sampling size = X : sia à 
the hypothesis upon the basis of which the e nn ae were 
calculated is exactly correct. Hence it is entirely oe yes 
Dose that this sample of male and female guinea pigs deviated irom 


If we read upward from x* 
then horizontally to the vertical 


134 SAMPLING FROM BINOMIAL POPULATIONS Ch. 5 


a 50:50 sex ratio purely as a consequence of the chance npo ea 
all sampling. On the other hand, if the sampling. x? had been : . 
say, we see from Figure 5.31 that sampling variation almost ro 
produces such a large value of x?. We would then conclude that the 
sex ratio was not 50:50. It is clear that such conclusions as these 
are valid only if a representative sample has been drawn. If just 
a few guinea pigs in one particular laboratory have been the basis 
for the sample, the conclusions drawn would not apply, without more 
sampling evidence, to guinea pigs in general. 

As a matter of fact, Figure 5.31 can be used as above for samples 
of various sizes just as long as there are only two classes of attributes 
involved. Under these circumstances, y? is said to have one degree 
of freedom. In this connection it should be noted that when the 
expected number has been calculated for one of the two classes, the 
other number follows automatically so as to keep the sum of the 
expected numbers equal to the sum of the observed numbers. Like- 
wise, if there are 3 too many males compared to expectation there 
must be 3 too few females; that is, there really is but one basic dif- 
ference between the observed numbers and the expected numbers. 
Basically, that is the reason there is only one degree of freedom for 
the x?. 

Table V makes it possible to determine more e 
the probability that 
the Ho is correct. 


asily and accurately 
a sample x? will exceed the observed value when 
Actually this mathematical distribution is not 
exactly right for the x? as defined in this chapter, but the loss of 
accuracy is negligible for most sample sizes which would cause peo- 
ple to have faith in the conclusions drawn therefrom. 


Problem 5.31, It was stated in Chapter 3 that the 
considered to be inherited in a simple Mendelian ma 
should produce offspring three. 
fourth are type O. Suppose th 


A-B blood groups were 
nner so that AO x AO 
O be type A and one- 
sample of 400 children from 
78 per cent are A and 22 per cent 
rejection of the hypothesis that 

, Or, in more symbolic terminology, should the hy- 
= 3/4 for A) be rejected? 


-fourths of whom test t 


pothesis Ho(p 


The expected numbers are 300 A and 10 
(12)*/300 + (12)2/100 = 1.92, with 1 D/F. 
by interpolation that P = 17. By any usual standards Ho (p = 3/4 


for A) is Accepted, especially if the hypothesis seems to be well 
founded theoretically. In some circumstances we would wish to 


0 O; therefore, y? = 
By Table V it is found 
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place a confidence interval on p. For example, the 95 per cent con- 
fidence interval could be obtained (from a larger table than 5.21 on 
page 127). Such an interval would include p = 3/4, because that hy- 
pothesis was accepted far above the 5 per cent level, but would also 
include other possible values of p. If this interval included other 
defensible hypotheses about p, they also would be acceptable as far 
as this sample evidence is concerned. Larger samples then could 
be taken with the hope of so narrowing the confidence interval that 
only one theoretically defensible hypothesis would be acceptable 


upon the basis of the sampling evidence. 


PROBLEMS 


1. According to Table 2.61, 50 female and 59 male guinea pigs were born 
during the period from January to April, inclusive. If these guinea pigs can 
be considered as a random sample of all guinea pigs as regards the sex ratio, 
is the observed difference in numbers of each sex sufficient to cause you to 
reject the hypothesis that the sex ratio actually is 1:1, if you wish to set the 
probability of committing an error of the first kind at .05? 

2. Use the data for May to August, inclusive, to answer the question of 
problem 1. Ans. No, P(x? = 1.12) = 30. 

3. Solve as in problem 1 for the data for September to December. 

4. Table 2.62 contains data from those guinea pigs which survived long enough 


to produce 4-day gains. Do the data for January to July, inclusive, indicate that 
the sex ratio is 1:1 for guinea pigs in that more select population which lives 
Ans. Yes, P > 53. 


at least 4 days? 
5. Solve as in problem 4 for the dat 


: 6. According to genetic theory, if a so l 
is mated with a white-eyed fruit fly, one-half the offspring are expected mathe- 


matically to be white-eyed. The reasoning is analogous to that given earlier 
Íor a mating of O and AB blood types. Suppose that among 500 offspring of 


such fruit flies, 240 are white-eyed. Does the x2-test indicate that such a sam- 
Ple result would occur rarely (P < 05) while sampling from a binomial pop- 
Ans. No, P= 37. 


ulation with p = 1/2, or not? 

7. If you assume (as is reasonable from 
38 in problem 6 before the hypothesis that p 
Small can the number of white-eyed flies be amon 


Would occur? ; 

8. Suppose that it were ugreed that you should not seriously doubt the 

hypothesis that p = 1/2 unless x? exceeds a value xo? which. is such that 

Po2z Xo?) 5.01. How small can the number of white-eyed flies among 500 
become b Do i /pothesis that p = 1/2? 

efore y rould reject the hypothesis p 

E ub c. Ans. 221 or 222. 

9. Suppose that a sample of 100 college students showed that 40 opposed. a 

certain proposal regarding student government. Does that result contradict 

the hypothesis that 48 per cent of the student body oppose the proposed change? 


a for August to December. 
-called heterozygous red-eyed fruit fly 


Figure 5.31) that x2 must be at least: 
= 1/2 should be rejected, how 
g 500 offspring before that 
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10. Use a confidence interval approach to answer the question in problem 9, 
and discuss the difference between the two methods. 
Ans. Clg;: 30-50 per cent opposed. Clg,: 27-54 per cent opposed. 
Both include 48 per cent. 
11. Suppose that a poll of Topekans (Kansas) shows that one candidate re- 
ceived 135 votes to 115 for the other candidate for a certain publie office. Use 
both the x?-test and confidence intervals to determine the probable winner, 


if the election is to be held very soon so that no appreciable change in opinion 
is expected. 


5.4 TESTING THE HYPOTHESIS THAT TWO RANDOM 
SAMPLES CAME FROM THE SAME BINOMIAL 
POPULATION 


The type of problem to which the tests described in this section 
apply arises when two groups of observations have been taken under 
somewhat different circumstances. The question to be answered is: 
Did the difference in circumstances produce two distinet binomial 
populations as far as can be told from these samples? For example, 
consider a simulated test of two house-fly sprays, one made from 
lethane the other from pyrethrum. Suppose that 500 house flies 
have been placed in each of two wire cages, identical in all respects. 


The lethane spray is applied to one cage, the pyrethrum spray to the 
other, with the following results: 


Spray Dead Alive Sums 
Lethane 475 25 500 
Pyrethrum 450 50 500 

Totals 925 75 1000 


Actually, the lethane spray killed 95 per cent of the flies in its 
cage, whereas the pyrethrum killed only 90 per cent. However, if 
both cages had been sprayed with the same Spray, different per- 
centages would have been killed in the two cages in all probability. 
How rarely would they have been as different as they were found 
to be in this experiment? The x^-test introduced in section 5.3 can 
be employed successfully in the solution of this problem, However, 
there is no predetermined hypothesis regarding the magnitude of p 
like that available before. Hence some other method must be used 


to caleulate the expected numbers needed in the x?-test, 


There is no theory regarding insecticides w 


h hich will furnish an 
expected proportion “dead” 


in the population ; but it was observed 
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that among the 1000 flies sprayed 92.5 per cent were later classified 
as dead. If these two sprays are equally toxic to the house flies, they 
should tend to kill equally many flies per 500 sprayed. Therefore, 
the probability of death can be taken as .925 on the general hypoth- 
esis that the two sprays are equally toxic. This is equivalent to 
the hypothesis, Ho(p, = ps). Then the expected number dead out 
of 500 in a cage is E(r) — .925(500) — 462.5. That leaves 37.5 as 
the expected number of survivors for each spray since 500 flies were 
sprayed with each spray. We then can extend formula 5.31 to obtain 
the following: 


x! = (475 — 462.5)?/402.5 + (450 — 462.5)*/462.5 
+ (95 — 37.5)2/37.5 + (50 — 37.5)?/37.5 = 9.01. 


This x? has only one degree of freedom as before because there is 
only one chance difference between the observed and expected num- 
bers. Note that only one expected number need be calculated before 
all the rest follow automatically from the border totals of the table. 
Figure 5.31 and Table V clearly indicate that x? rarely would attain a 
Size of 9.01, or more, purely from sampling variations; therefore it is 
concluded that the lethane spray is superior to the pyrethrum spray, 
that is, the hypothesis that pi = Pa İS rejected, where p, = true pro- 
Portion which would be killed by lethane and pə is the same for 
Pyrethrum over many trials. 

The technique just deseribed al 
random samples supposedly drawn from the same binomial popula- 
tion actually are from a common population. For example, suppose 
that two separate random samples were taken on the toxicity of a 


lethane spray, with the following results: 


so can be used to decide if two 


Dead Alive Sums 


MN se eae 

Sample 1 480 20 500 

Sample 2 380 20 400 
Sums 860 40 900 


If p remained constant during this sampling it is best estimated as 


p= 860/900 = .956 or 95.6 per cent. In the absence of any logical 
Predetermined hypothesis, the hypothesis Ho(pi = P2) is tested, 
Where p, = true probability of death during the taking of the first 
Sample, and similarly for p» and the second sample. If the prob- 
ability of death for any randomly designated fly stays fixed, pi = ps. 
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As usual, the expected number killed during the first sampling is 
computed to be E(r) — .956(500) — 478, which deviates from the 
observed number by only 480 — 478 = +2. It follows that the other 
observed numbers also differ from their expected numbers by 2 in 
one direction or the other. Hence 


(+2)? | (-2? (—2)? (+2? 
2 
X 7 ws ' 38 ' m + 


= 0.422, 


with 1 D/F. It is learned from Table V that P = .52; hence Ho is 
accepted readily, and it is considered that the two samples were 
taken under conditions which kept the probability of death constant. 
It is not always true that the population can be kept the same under 
repeated sampling; hence it is well to check this matter before dif- 
ferent conditions (such as use of different insecticides) are purposely 
introduced so that their effects can be studied. 


Problem 5.41, Suppose that two sample polls of votes for two candidates 
for a public office are taken, one from among residents of cities with at least 
25,000 population, the other from among residents not in any incorporated 
town or city. If the results were as given below would you accept the statement 
that place of residence was unrelated to voting preference in this election? If 
so, the two samples are from a common binomial population, 


Votes for 
A B Sums 
Rural 620 380 1000 
Urban 550 450 1000 
Sums 1170 830 2000 


Over both the rural and urban samples 58.5 per cent voted for A. 
If both samples are from the same binomial population, f = .585 is 
the best available estimate of p, the true fraction who favor A. Hence 
the hypothesis Ho(p, = pu) will be tested by means of the x? distri- 
bution. The expected number of rural residents out of 1000 who favor 
A is .585(1000) = 585. It deviates from the observed 


number by 
620 — 585 = +35; hence 


1 1 1 1 
e| J- 
PN! su Far ag] 1009 12/8. 


It is apparent from Table V that Ho should be rejected because 
PE 002. It is concluded that Pr 


actually is > p,: that is, the resi- 
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dents of rural areas favor candidate A more strongly than do the 
urban residents because the observed results are very unlikely to 
be a sampling accident. 

PROBLEMS 


1. Lerner and Taylor, University of California, published the following data 
on chick mortality in the Journal of Agricultural Science, Volume 60: 


Number Progeny Which 


Sire Died Lived 
G14 22 65 
G36 44 35 
G52 17 45 
H8 22 39 


How would you rate these sires as regards low progeny mortality after taking 
account of sampling variability? 
2. Compute x? for the following practice data and obtain from Table V the 


probability that sampling variation alone would produce a x? at least this 


large. Also explain how information of this sort is used to test a hypothesis 


about a binomial population. 


Answered 
Yes No Sum 
First Sample 187 113 300 
Second Sample 213 187 400 
400 300 700 


Sum 
Ans. P(x2 > 5.16) = .002. 


3. Given the following x?'s, each with one degree of freedom, classify each 
as probably due to chance alone, or not, if an event which is as unlikely to 
occur as 1 time in 20 is considered to be purely a chance occurrence: 39, 7.1, 


0.95, 2.1, 15.2, 8.7, and 12. 
4. Within what approximate 
values of x? with one degree O: 

18 correct? 
5. For a population of x? 


— 1, and the standard deviation = ¢ = oxin i 
of the population of x? with 1 D/F (see Table V) lies in each of the following 


rahges: y + lo, w+ 2c, and w+ 30? How do these proportions compare with 
the corresponding ones for a no: mal distribution, as shown in Table n? What 
information does this set of comparisons give about the shape of the chi-square 
frequency distribution curve when x2 has one degree of freedom? 


limits do the lower 75 per cent of all sampling 


f freedom lie when the hypothesis being tested 
Ans. 0 to 131. 


's each with one degree of freedom, the mean = 4 
V3. Approximately what proportion 
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6. Samples from records of male and female White Rock chicks up to 8 
weeks of age raised at Kansas State College during 1945 gave the following data: 


Sex Died Lived 
Male 46 227 
Female 30 290 


Use the x?-test to decide if there probably is a fundamental difference in chick 
mortality due to sex. Ans. P(x? = 7.35) =.007. 

7. The following data are derived from a publication by Atkeson et al. (Jour- 
nal of Economic Entomology, 37:428-35) on the effectiveness of 5 sprays in 
killing flies around dairy barns: 


Number of Flies 


Spray Killed Not Killed 
A 22 84 
B 49 90 
Cc 89 28 
D 39 63 
E 44 24 


Which sprays do you consider as essentially equal in killing power, considering 
sprays which do not differ beyond reasonable sampling variation as being tied? 


5.5 THE y-TEST WHEN MORE THAN ONE DEGREE 
OF FREEDOM IS REQUIRED 


There are many problems in sampling which require the use of 
the sampling distribution of a x! with more than one degree of free- 
dom, but only a few will be considered in this book. For example 
suppose that both parents have the specific blood types AO and 
Rhrh, in the symbols of Chapter 3. Each parent produces four types 
of gametes: ARh, Arh, ORh, and Orh, with equal frequencies it is 
believed. Hence it ean be deduced that such parents will produce 
offspring of four blood types: ARh+, ARh-, ORh+, and ORh— 
with associated probabilities 9/16, 3/16, 3/16, and 1/16, reapentively. 
Therefore, if a large number of such parents is obtained for a ‘aia 
sample we can test the hypothesis suggested by the above argument, 
namely, Ho(9 ARh+ :3 ARh— : 3 ORb4-:1 ORh—). To illus. 
trate, suppose that out of 1600 such families in a random sample, the 
children were classified as follows with respect to the A-B and Rh 
blood groups: 885 ARh-+, 310 ARh—, 292 ORh+, and 113 ORh— 
Do these observed numbers deviate enough from the corresponding 
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theoretical numbers 900, 300, 300, and 100, respectively, to justify 
the rejection of Ho? In these circumstances 
" (885 — 900)? : (310 — 300)? (292 — 300)? , (13 — 100)? 
900 300 300 —— 100 

= 2.49, 
by analogy with previous problems when x? had only one degree of 
ireedom. How many degrees of freedom does this x? have, that is, 
how many of the deviations from theoretical expectations can be 
considered due to chance? The observed numbers and the expected 
numbers each must add to 1600, hence there cannot be more than 
3 degrees of freedom. As this is the only such restriction, there are 
just 3 degrees of freedom. Naturally a sampling chi-square which 
results from 3 chance deviations usually will be larger than one based 
on fewer degrees of freedom because more “room” must be left for 
sampling fluctuations. It is seen in Table V that a x with 3 degrees 
of freedom will exceed the observed value, 2.49, about one-third of 
the time when Ho is correct; that is, P = .33. Therefore, the hy- 
pothesis H,(9 ARh+ : 3 ARh- :3 ORh+:1 ORh- ) is quite ac- 
ceptable in view of this sample evidence. 

Another circumstance which produces a x? with more than one 
ntered when the same hypothesis is tested 


degree of freedom is encour 
more than once by means of successive but independent random 


samples believed to have been taken under the same conditions. It 
may not be reasonably possible to obtain a convincingly large sample 
during any one experi that some means of ac- 


cumulating statistical evidence from two or more studies is needed. 


This problem can be solved with the aid of the following theorem. 


x 


ment or survey so 


Theorem. If two or more sample chi-squares are obtained from 
independent random samples, the sum of these chi-squares fol- 
lows the chi-square distribution for a number of degrees of 
freedom equal to the sum of those for the chi-squares so added. 


Obviously, the process to which the above theorem refers would 
make no practical sense unless each x^ were obtained while the same 
hypothesis was being tested. It also is important to be assured that 
all the samples have been drawn from the same binomial population, 
regardless of the truth of the hypothesis Ho because nothing is ac- 
complished by such a study if several different populations are in- 
volved, We are trying to test one predetermined hypothesis which 
supposedly applies to a fixed set of conditions. To illustrate these 
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ideas, suppose that some respected group of persons has conjectured 
that 60 per cent of the voters in a certain area will vote yes on a 
given economie question, and that this conjecture is to be tested by 
means of three samples taken in the three distriets in the area in- 
volved. It will be assumed that these districts contain equally many 
voters. 

Before the hypothesis that p — .60 for the whole area is tested, 
it is of interest to determine if the three districts are the same bi- 
nomial population with respect to yes and no votes on the economic 
question which is to be asked the voters. Hence it is supposed that 
& poll of 200 randomly chosen voters in each district gave these 
results: 


Number Voting 


District Yes No Sum 
1 105 95 200 
2 100 100 200 
3 125 75 200 


Sum 330 270 600 


If the whole of each district has the same fraction, p, of yes votes, 
the best estimate of p is $ = 330/600 = .55 or 55 per cent. -If this is 
used as the probability that a randomly chosen voter in a given 
district will vote yes, the expected number of yes votes in each dis- 
trict is .55(200) = 110. That leaves 90 as the expected number of 
noes; hence 
2 _ (105 = 110)? (75 — 90)? 
x nes Uu cuo co MS 

It is seen that the observed number of yeses in the first district is 5 
below expectation; hence the number of noes is 5 above expectation, 
and only one chance difference between observation and theoretical 
expectation exists. The same can be said for district 2; but since we 
know that the yes vote in district 2 was 10 below expectation and 
that in district 1 is 5 below expectation, it follows that the number 
of yeses from district 3 must have been 15 above expectation. Hence 
only two chance deviations are involved basically, and this x? has 2 
degrees of freedom. 

The specific hypothesis being tested is that the true proportions 
of yes votes in the three districts are equal. Table V indicates that 
it is rather uncommon during sampling experience for a x? with 2D/F 
to become as large as the 7.07 observed for these samples if the hy- 


—— 
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pothesis used is correct. In fact P(x? = 7.07) = .03. If we have 
decided in advance to reject a hypothesis when P < .05, Ho(pi = 
p» = ps) would be rejected, and we would say that the true fraction 
of yes votes is not the same in all three districts. It is clear that after 
such a decision it would not be valid to conduct a separate survey 
in each district and then combiné the evidence from these samples 
on the assumption that we have three independent xy?'s testing the 
same hypothesis, as is supposed in the theorem stated earlier in this 


: section. 


It appears from the samples given above that p; does equal ps, 
but that ps is greater than pı or pe. This hypothesis could be tested 
by the method just illustrated; but for the purposes of this discussion 
districts 1 and 2 will be used to test the original conjecture that 
p = 6 for the area covered by districts 1 and 2. 

For district 1, the expected number of yes votes is E(r) = .6(200) 
= 120 votes. Therefore, x? = (15)°/120 + (15)2/80 = 4.69 with 
1 D/F so that P = .030 by Table V. On the basis of this sample 
evidence Ho(p = .6) is rejected at the 3 per cent level. 

For district 2, the same expected numbers are used because 200 
votes were recorded in this sample also. Therefore, x? = (20)?/120 + 
(20)2/80 = 8.33 with 1 D/F so that P =.003. This time Ho(p = .6) 
is rejected more decisively. 

By the theorem of this section, x? = 4.69 + 8.33 = 13.02, with 
2 D/F so that P = .002 by Table V. Therefore Ho(p = .6) is re- 
jected at the 0.2 per cent level upon the basis of the evidence in the 
two 200-vote samples. 

The chi-square distribution with more than one degree of freedom 
may be useful when the data are classified in a two-way table of r 
rows and c columns. For example, à random sample of Republicans 
and Democrats in a certain city might each be grouped on the basis 
of three income brackets as follows: 


Annual Income 


Party Under $5000 $5000-$9999 $10,000 and Over Sums 
Republican 200 50 8 258 
Democrat 120 20 3 143 

Suma 320 70 H 401 


ontingeney table. Earlier in this 


This will be described as a 2 by 3 € 
were analyzed by means of the 


chapter 2 by 2 contingency tables 
chi-square distribution. 
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It is likely that a person might wish to answer the following ques- 
tion with the aid of the data in the above table. Is the announced 
party affiliation of a voter in this city associated with that voter's 
economie status? Or, in statistical terminology, the question can be 
rephrased as follows: Let p; be the proportion of Republicans in this 
city with incomes under $5000, p» be the same for Republicans with 
incomes in the middle income group, pa the same for those Repub- 
lieans in the highest income bracket; and let p/ — the corresponding 
proportions in the population for the Democrats in this city. The 
subscript i takes the values 1, 2, and 3. Then we wish to test the 
more complex hypothesis: Ho(p; = př, i = 1, 2, 3). As usual, Xp; = 
Xp/ = 1 for i = 1, 2, and 3. 

The p; and p/ are unknown and will be estimated from the sample 
observations on the assumption that Ho is correct. These estimates 
of parameters will be obtained as before: fı = pı’ = 320/401; fa = 
f» = 70/401; and fs = fa” = 11/401. It follows that the expected 
number of Republicans in the lowest income stratum is (320/401) 
(258) = 205.9. The other expected numbers are computed in a 
similar manner and are shown in parentheses in the following table: 


Annual Income 


Party Under $5000  $5000-$9999 $10,000 and Over Sums 
Republican 200(205.9) 50(45.0) 8(7.1) 258(258.0) 
Democrat 120(114.1) 20(25.0) 3(3.9) 143(143.0) 

Sums 320(320.0) 70(70.0) 11(11.0) 401(401.0) 
(200 — 205.9? | (120 — 114.1)? (3 — 3.9)? 
Mu Tm : 
Boon 205.9 114.1 Hga 585 


How many degrees of freedom does this sampling chi-square have? 
In the process of estimating the p; and the p/ the expeeted numbers 
in a eolumn were required to add to the same sum as the observed 
numbers for the same columns. This causes the deviations from 
expectation in a column to be the negatives of each other, For exam- 
ple, 200 — 205.9 — — (120 — 114.1). Therefore, both these devia- 
tions of observation from expectation cannot be chance occurrences. 
There are, then, at most three ehance deviations among the six which 
go into the computation of the chi-square. Furthermore, the expected 


numbers of Republicans in the three income classes must add to 258, 


AA 
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the total number of Republicans in the sample. A similar statement 
holds for the Democrats, but this is not an independent requirement 
because the six expected numbers have been forced to total 401 by 
making the column totals add to the observed numbers 320, 70, and 
11. Hence the number of chance differences between the observed 
numbers and those expected mathematically upon the basis of Hy is 
reduced to 2. This, then, is the number of degrees of freedom. 

In general, the number of degrees of freedom for a chi-square cal- 
culated for an r x c contingeney table is (r—1)(c— 1). In the 
example above, r — 2 and c= 3; hence, (r — 1)(c — 1) 2 2. 

Having decided that the x? of 2.35 has 2 D/F, it remains to deter- 
mine from Table V if this is an unusual size for a sample chi-square. 
Table V shows that P (x? = 2.35, 2 D/F) = 31, approximately ; there- 
fore it is entirely reasonable that this x2 occurred while sampling 
from a population for which the hypothesis, Ho, is true. With this 
sampling result at hand, we would accept the proposed hypothesis. 


PROBLEMS 


1. Suppose that of 300 salmon which went up a fish ladder in a certain river 
185 were chinooks, 65 were silver salmon, and 50 were humpbacks. At another 
ladder farther south suppose that the following numbers were recorded: 
chinooks, 150; silvers, 80; and humpbacks, 20. Do these samples (if satisfac- 
torily random, and such is assumed) support the belief that the proportions of 
these three species are the same at the two locations which were sampled? 

2. Referring to problem 1, what matters would cause you to consider them 
as truly random samples? What factors might cause you to think they were 
not? 

3. Suppose that three independent 
following records: 


samplings at one fish ladder led to the 


Number Which Were 
Not 
Chinook Chinook Sum 


Sampling 
First 60 25 85 
Second 70 30 100 
Third 52 18 70 
Sum 182 73 255 


oks stayed the same during the 
ding to these data? 

ted yellow, short-leaved, and 
were being seriously attacked 


Is the hypothesis that the percentage of chino! 


time of the sampling an acceptable one accor 
4. Suppose that some entomologists investiga 
spruce pines in a certain forest to see how many 
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by insects. Assume the following data were obtained from random samples 
of 250 of each species: 


Seriously Not 
Species Damaged Damaged Sum 
Yellow 58 192 250 
Short-leaved 80 170 250 
Spruce 78 172 250 


Do the insects studied attack one of these species more than another? Or is 

the assumption that the percentage of seriously damaged trees is the same for 

all these acceptable? Ans. Chi-square = 11.47, 2D/F, P = .03. 
Reject the assumption. 

5. For each species of pine studied in problem 4, test the hypothesis that 

one-third of the trees in each species population are seriously damaged. Then 

combine these tests by adding the chi-squares, and draw appropriate conclusions. 


5.6 CONTROL CHARTS 


Sampling techniques appropriate to binomial populations have 
some important applications in industry in addition to those con- 
sidered previously in this chapter. During a manufacturing process 
designed to produce marketable goods it is important to check fre- 
quently upon the quality of these products. Quality control charts 
provide a simple but effective means for watching both the general 
level of quality and the consistency with which this level is being 
maintained. No attempt will be made herein to discuss all the 
various methods in use because books devoted solely to industrial 
statistics or to quality control are available on this subject. How- 
ever, it can be seen rather easily that some of the topies presented 
earlier in this book are fundamental to this subject, The subsequent 
remarks in this section are intended to point out some of these funda- 
mentals. 

Consider first a manufactured item which could be classified as 
either defective or non-defective with respect to predetermined stand- 
ards of production. Clearly, a binomial frequency distribution must 
be involved with some unknown proportion, p, of defective produets 
being manufactured. The number of items inspected and classified 
as defective or non-defective is the n in the previous discussions of 
sampling from binomial populations. As indicated earlier the stand- 
ard deviation of a proportion derived from a sample of n observations 
is Vp(1— p)/n. If the manufacturing process is running smoothly 
with p — .05, say, and then Something occurs to increase the fraction 
defective to .15, this oceurrence will reveal itself in two ways: (a) the 
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observed fraction defective will tend to increase rather soon, although 
it might not do so for several samples; and (b) the variability among 
samples will increase if p changes toward 1/2, and this is the case 
cited above. Both these points are illustrated by means of Table 
5.61 and Figure 5.61. 

TABLE 5.61 
SAMPLES WITH n = 50, p STARTING AT .05 AND IxcnEasIxG .002 PER SAMPLE 


FROM THE TWENTY-SIXTH TO THE SEVENTY-FIFTH SAMPLES, INCLUSIVE, AFTER 
Wacon Ir REMAINS AT 15 


Sample Fraction Sample Fraction Sample Fraction 
Number Defective Number Defective Number Defective 

1 .10 35 . .10 68 28 

2 .06 36 10 69 14 

3 .08 37 .02 70 .08 

4 .10 38 .06 71 ,19 

5 .04 39 .10 72 2 

6 .06 40 .04 78 14 

7 .06 41 .06 74 .16 

8 .10 42 14 75 .12 

9 .06 43 .16 76 .22 
10 .04 44 .12 77 .20 
11 .06 45 .06 78 .12 
12 .00 46 .08 79 14 
13 .06 4T .10 80 14 
14 .06 48 .06 81 12 
15 .04 49 .04 82 14 
16 .04 50 .04 83 .10 
17 .04 51 .08 84 .20 
18 .02 52 14 85 .12 
19 .02 53 .12 86 14 
20 .08 54 .04 87 .04 
21 “06 55 .12 88 .16 
22 “10 56 .08 89 .18 
23 “04 57 .16 90 EIS 
24 .06 58 .10 91 14 
25 .04 59 18 92 .16 
26 .02 60 14 93 .08 
27 .06 61 .04 94 .18 
28 .04 62 .08 95 .20 
29 .02 63 .10 96 24 
30 .08 64 .10 97 .16 
31 .04 65 15 98 .10 
32 .10 66 12 99 .06 
33 .00 67 .12 100 .14 


34 .08 
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"Table 5.61 and Figure 5.61 were obtained in the following manner: 
(a) p was fixed at .05 by combining 950 green and 50 red beads in 
a receptacle, and considering the red beads as defective manufactured 
items; (b) 25 samples with n = 50 were drawn and the fraction de- 
fective was plotted over the order number of the sample; (c) start- 
ing with the twenty-sixth sample and continuing through the seventy- 
fifth, two green beads in the receptable were replaced by two red 
beads after each successive sample was drawn and recorded; and 
(d) starting with the seventy-sixth sample, no additional changes 
were made. In brief, p — .05 for the first 25 samples; p increased 
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Figure 5.61. Control chart for Table 5.61. 


.002 per sample (to simulate slippage or excessive wear) for the next 
50 samples so that it finally was .15 for the last 25 samples. 

The horizontal lines in Figure 5.61 marked UCL (upper control 
limit) were obtained from p + 3Vp(1— p)/50 with p = .05 until 
the seventy-sixth sample, and p — .15 thereafter. Unless something 
has occurred unknowingly to change the size of p the fraction defec- 
tive rarely will go above the UCL; hence, when the observed fraction 
defective frequently exceeds this limit, it is suspected that the manu- 

facturing process has broken down to some degree. It can be seen in 

Figure 5.61 that when the “machine” had “slipped” and P began to 
increase, the fraction defective soon started an upward trend. 
Shortly, it exceeded the UCL which had been set on the supposition 
that p = .05. Then when p ceased to increase and a new UCL was 
figured with p = .15, the fraction defective again stayed below the 
UCL. Generally, there also is a lower control limit (LCL), but in 
this situation it would have been negative and was taken as PT as 
1s customary. 

In practice when the percentage of defectives is unknow: 


. t n the frac- 
tion defective observed on at least 25 E 


amples is used in place of p 
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in the procedure described above. As more samples accumulate a 
better estimate of p can be computed and used. 

It may be that the quality of a manufactured item is judged by 
means of a measurement such as length, diameter, weight, or a volume 
which is likely to be a member of a near-normal population rather than 
principles involved are the same but now 
Z and s; (see Chapter 6 for definitions) must be used instead of esti- 
mates of p and its standard deviation. In such situations the upper 
and lower control limits are given, respectively, by: 


3 +3V (S27)/n(n = n. 


Aside from this change, the control charts are constructed and inter- 
preted as before. Of course, sample means, Z;, are plotted against 


order of draw. 
In view of the fact that the sz is somewhat tedious to compute, it has 


been found to be both satisfactory and economical to use control limits 
which employ the range as the measure of variation instead of the 
standard deviation. This procedure and the necessary tables are given 
and discussed in publications on quality control or on industrial statis- 


ties, and will not be given here. 


a binomial population. The 


PROBLEMS 


1. Another group of samples taken under conditions deseribed for Table 5.61 
gave the following results. Make a control chart similar to Figure 561 and 


draw appropriate conclusions. 


Sample Fraction Sample Fraction Sample Fraction 
Number Defective Number Defective Number Defective 
17 .04 33 .06 
P .06 34 .08 


.00 18 
06 19 .02 35 .10 
pss 20 .08 36 .04 
.06 ud 
21 .00 37 04 
.08 
22 .00 38 .02 
.06 
23 “06 39 .08 
.08 
ob 24 .02 40 .08 


25 .02 41 .00 
402 A 42 10 


1 

2 

3 

4 

5 

6 

7 

8 

R 04 

1 26 s 

H po 27 .02 43 -10 
12 ‘02 28 .06 = <08 
13 .02 29 .00 45 .16 
14 06 30 00 a ale 
15 “00 31 04 47 .10 
16 .02 32 .06 48 -16 
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Sample Fraction Sample Fraction Sample Fraction 
Number Defective Number Defective Number Defective 
49 .12 67 .06 85 .12 
50 -16 68 .16 86 .14 
51 .02 69 .12 87 .12 
52 .12 70 .10 88 +22, 

53 .12 7l .08 89 .08 
54 .14 72 .12 90 .18 
55 .04 73 .18 91 .10 
56 .08 74 .06 92 .12 
57 .14 75 .10 93 .18 
58 .12 76 .24 94 .06 
59 .10 TT .10 95 .10 
60 .08 78 .14 96 .06 
61 .12 79 .18 97 .06 
62 .12 80 .16 98 .10 
63 .06 81 .18 99 .22 
64 .08 82 .18 100 .06 
65 .14 83 .14 
66 .08 84 .08 


2. Caleulate an estimate of p from the first. 25 samples of problem 1, for the 
next 25, and for the last 25 samples. Discuss the effect their differences would 
have on the control chart. 

3. Use the estimate of p from the first 25 samples of problem 1 to recompute 
the UCL for Figure 5.51. 

4. Draw 50 successive samples of 50 each 
fraction defective (furnished by the instruc 
from those observations. 

5. Perform the operations re: 
furnished by the instructor, 


from the laboratory population on 
tor) and construct a control chart 


quired in problem 4 for a near-normal population 


REVIEW PROBLEMS 


1. Which of the following bridge hands are 


you the more likely to receive 
on one future random deal? 


(a) A, K, 10, 4, 3, and 2 of hearts; A, Q, 
of spades; and the ace of clubs. 
(b) No card larger than a 6. 


and 10 of diamonds; K, 10, and 3 


2. Suppose that a coin is so biased that it t 
times that it shows tails, on the average. 
there will be fewer heads than tails? 

3. Suppose that you have taken the bid in a bridge game and that you and 
the dummy have all the trumps except Q, 8, 5, and 2, What is the probability 
that you would get out all the trumps on Successive leads of the A and K of 
trumps? 


urns up heads 3 times for each 2 
What is the probability that on 8 flips 


Ans. .17. 
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n The priees of barley in the North Central States during 1945 are given 
e ow in cents. (Agricultural Statistics, 1946, USDA.) Compute two different 
erages of these prices and discuss their meanings and their limitations. 


State: Ohio Indiana Illinois Michigan Wisconsin Iowa Minnesota 


Price: 106 111 111 118 119 103 107 
State: Missouri N. Dakota S. Dakota Nebraska Kansas 
Price: 116 102 103 96 97 
5. Determine and interpret the coefficient of variation for the prices of 
problem 4. 
E Calculate the mean deviation for the prices of problem 4 and discuss its 
eaning. Ans. AD = 632. 


e Island Red hens are considered to be nor- 


mally distributed with » = 60.5 grams and o = 40 grams, what range of eg 
weights would you expect to include the middle 90 per cent of all weights? 

8. The following table (taken from A. S. Weiner, Blood Groups and Trans- 
fusions, Thomas, with the consent of the author and the publisher) records the 
results of a study of the inheritance of the P factor. 


7. If the egg weights for Rhod 


Number Children's 


Parents' of Blood Types 
Type Families P+ Be: Total 
P+ XP+ 249 677 79 756 
P+ XP- 134 286 179 465 
P- xP- 34 (4) * 94 98 


* Definite doubt established regarding legitimacy. 


Recalling that P+ is genetically PP or Pp, and that P— is only pp, test 
Statistically the agreement between the above data on children’s -blood types 
and the numbers expected if P+ is assumed to be Pp twice as frequently as it 
is PP. Consider the (4) * entry as zero. 

22-033) > 53; Py = 5.57) = 018 


Ans. PC 
on P+ X P+ and P+ X P-, respectively. 


9. Make up a set of numbers which has an arithmetie mean of 10 and & 


Standard deviation of 2. 

, 10. Is there any evidence in the tal 

tion that P-- = Pp twice as frequently 
11. Suppose that a large, deep pool in 

many trout of just two kinds, rainbow 

oo are rainbows. Two methods 0! 
r, 


ble of problem 8 for or against the assump- 
as P+ = PP? Explain. 

a mountain stream contains a great 
and brook. You wish to learn what 
f sampling have been suggested thus 


(a) Fish the pool until 50 trout are caught, and then use this sample evidence 


as the basis for estimating P. 
(b) Devise a trap into which the trout will go and be caught, and secure a 
Sample of 50 this way. 
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Assuming that b can be done, which method of sampling, if either, do you 
recommend as statistically best? Why? Can you suggest a better method 
than either of these? 

12. Referring to the situation of problem 11, suppose that the true proportion 
of rainbow trout is 40 and that S per cent of the rainbows in this area are 
known to be afflicted with a certain disease. What is the probability that a 
trout caught at random is a rainbow trout without the disease? Ans. P = 37. 

18. Referring again to problem 11, assume that 60 per cent of the trout in 
another stream in this same region are brook trout. If on a random sample of 
50, 22 are brook trout and the other 28 are rainbows, would you accept or reject 
the hypothesis that p — .60 for brook trout in this stream? Explain. 

14. Suppose that there are two mountain streams which run quite close to- 
gether in a certain area but whose head waters are far apart. You wish to 
know if the trout populations of these two streams are the same, in a certain 
well-defined area, as regards proportions of the four species: rainbow, cutthroat, 
brook, and dolly varden trout. Given the following random sampling data, 
what would you conclude? 


Number of Trout 


Rain- Cut- Dolly 
bow Brook throat Varden Sum 
Stream 1 73 68 49 10 200 
Stream 2 70 85 80 5 250 
Sum 143 153 129 25 450 


Ans. Chi-square = 5.05 3D/F, P = 13 


15. Referring to problem 14, compute and interpret the CI, 


95 On the true per- 
centage of brook trout in stream 2. 
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CHAPTER 6 


Introductory Sampling Theory 
for a Normal Population 


Involving Only One Variable 


When the population being sampled has a normal frequency dis- 
tribution with unknown parameters p and c, the problems of estima- 
tion and of testing hypotheses by means of samples are fundamentally 
much the same as those considered in Chapter 5 for binomial pop- 
ulations. Two differences are immediately apparent. (a) There now 
s instead of one, as for the binomial pop- 
ulation, and (b) the measurements, X, have a continuous scale of 
measurement and a continuous frequency distribution. These dif- 
ferences between the normal and the binomial types of populations 
Will appear in the discussions below as the causes of some changes 
in the mechanics of estimation and of testing hypotheses; but the 
reader should not lose sight of the fact that the problems and their 


Solutions are much the same as in Chapter 5. 


are two unknown parameter 


61 OBTAINING THE SAMPLE 


good samples from a normal population 


Chapter 5 for randomization and the 
lation to be sampled 
o be taken on the 


. The process for obtaining 
is similar to that discussed in 
avoidance of biases. Here, as there, the popu 
must be clearly defined, and the measurement t 
units of this population must be stipulated precisely. 

After the population is specified and the units (persons, prices, 
Pigs, plots of land, pots of plants, families, ete.) have been designated 
unambiguously, it is necessary to devise a method for obtaining the 
partieular units which are to constitute the sample. The sampling 
Situations which come within the scope of this chapter should be 
handled by completely randomized samples. To illustrate, suppose 
that a person who is interested in the production of raw rubber wishes 
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to estimate the percentage rubber in a certain variety of guayule 
grown under specified environmental conditions. Suppose also that 
he wishes to select and to analyze 25 plants as a basis for this esti- 
mate. The population parameter which is to be estimated is the true 
average percentage of rubber in plants of the given variety. Assum- 
ing that there is a large number of guayule plants from which to 
select a sample, how should the particular 25 of the sample be chosen? 
If the 25 tallest, sturdiest, or most thrifty-looking plants were to be 
chosen they surely would not be representative of the population. 
If a person were to stroll about among the available plants and choose 
25 in what he considered a random manner, he might unconsciously 
bias the sample. A better way to choose the sample is to assign 
location numbers (such as row and plant-in-row numbers) to the 
plants and then effectively “just draw 25 numbers out of a hat." He 
can use tables of random numbers and similar devices if he chooses. 
The main point is to see that every plant in the population had at. 
the start an equal and independent chance to be included in the 
sample. 

If two varieties of guayule were compared for percentage of rub- 
ber, it might be best to start with a suitable area of land staked off 
for tree spacings and then assign the varieties at random to the 
various planting positions. This would make it true that each 
variety initially had an equal chance for any good, or bad, land 
among the possible planting positions. 

The subject of this section is very bro: 
cause there are many different sampling situations and a consequent, 
need to devise different sampling procedures to fit these different cir- 
cumstances. However, as in Chapter 5, only enough is s 
give the reader some general ideas 
more reading on this subject if he is interested. At the least, the 


reader can be critical in accepting sampling results presented as 
information, advertising, or propaganda. 


ad and complex partly be- 


aid here to 
and, perhaps, induce him to do 


62 THE STATISTICAL DISTRIBUTION OF SAMPLE 
MEANS, ;; DRAWN FROM A NORMAL POPULATION 


Each sample drawn from a normal population of numerical meas- 


urements will nearly always differ from any other sample from the 
same population in one or more details. Yet certain features of 
samples from a population, as a group, will tend to conform to a 
predictable pattern. For example, if 10 observations are to be taken 
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on a normal population with y = 60 and c = 10, no one can say 
what the arithmetic mean of the sample will be; but a good estimate 
can be made of its probable size because sample means from such a 
population will have a frequency distribution over the long run of 
experience. Therefore, it should be expected that probability state- 
ments like those previously discussed herein can be made. 

The sample mean, to distinguish it from the unchanging population 
mean, will be designated by 3;, where the subscript refers to the ith 
sample. 

The frequency distribution of an approximately normal population 
with y = 60 and e = 10 is presented in Table 6.21. Six hundred and 
forty-eight random samples, each containing 10 measurements, were 
drawn from that population. The arithmetic means of these samples 
were then computed. The frequency distribution for the 648 sample 
means also is given in Table 6.21, with the caleulated mean (£) and 
oa deviation (sz) of the 3; being given at the bottom of the 

able. 
TABLE 6.21 


A Frequency DISTRIBUTION TABLE FOR A Near-NormMat POPULATION OF 

MEASUREMENTS X; WITH y = 60 AND 0 = 10; AND THE FREQUENCY DISTRI- 

BUTION or 648 SAMPLE MEANS, Ži, TAKEN FROM Tuar POPULATION WITH 10 
MEASUREMENTS PER SAMPLE ,. 


Distribution of Sample Means 
Class Interval f ref. 


69.1-71.0 0 1.00 


Distribution of Population 
Class Interval — f ref. 


86.1 -90.0 7 13400 

82.1 -86.0 13  .1.00— 67.1-69.0 7 100 
78.1 -82.0- 31 .99 65.1-67.0 27 .99 
74.1 -78.0 64 .97 63.1-65.0 63 .95 
70.1 -74.0 113 .92 61.1-63.0 140 .85 
66.1 —70.0 169 .85 591-01.0 161 .63 
62.1 -66.0 218 74 57.1-59.0 136 .89 
58.1 -62.0 241 .59 55.1-57.0 72 .18 
54.1 -58.0 222 .43 53.1-55.0 34 .06 
50.1 —54.0 176 .28 51.1-53.0 8 .01 
46.1 —50.0 120 .16 49.1-51.0 0 .00 
42.1 -46.0 69 .08 A 

38.1 -42.0 34 .04 Total 648 

34.1 -38.0 15 .02 

30.0*-34.0 8 .01 


'Total 1500 
u= 60 g — 10 


i * This interval was extended by 0.1 to include 
n the population. 


Z = 59.98 s; = 3.14 
the remaining measurement 
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The frequency distribution in the right-hand part of Table 6.21 is an 
approximation to an infinite population of Z; which would result if this 
sampling with n — 10 observations were continued indefinitely. Any 
one random sample from the normal population described above neces- 
sarily would be a member of the population of z;. 

As an illustration of the preceding discussion, suppose that an agri- 
cultural economist is interested in learning if the per-acre income on a 
certain type of farm employing good (recommended by an agricultural 
experiment station, for example) farming practices is greater, on the 
average, than that for farmers not following those practices. He takes 
a sample of n farms on which the recommended practices are employed 
and calculates the mean per-acre income. The same is done for a 
comparable random sample of farms on which these recommendations 
are not followed. Some measurement of the consistency of income on 
each of the two groups of farms also would be needed. If the newer 
practices are worth recommending to replace those currently in use, 
they must produce a new population of per-acre incomes with a larger 
mean, a smaller variance, or both. To obtain information on these 
points, the economist must have adequate information regarding the 
manner in which sample means are distributed; that is, where their 
region of concentration will be, and how they will tend to be dispersed 
about that region of concentration. Hence, the first objective of this 
chapter will be to provide that sort of information about Z's drawn 
from the same normal population of numerical measurements—such 
as per-acre incomes. 

Figure 6.21 presents the graphs of the frequency distributions shown 
in Table 6.21. The larger curve is for the near-normal parent popula- 
tion of X’s, while the smaller curve is taken as a good approximation 
to the distribution of the population of #’s obtained from samples of 
ten observations taken from the population of X's. 

It appears from Figure 6.21 that the two frequency distributions are 
much alike in general form, and seem to be approximately normal 
about the same mean. The major difference lies in the fact that the 
4; exhibit much less variability than the X’s of the population from 
which the samples were taken. This is to be expected because one 
important reason for combining a number of individual X’s into one 
sample is to achieve a smoothing out of the individual differences 
among those X's. 

It should be noted from the bottom of Table 6.21 that the mean of 
the 2; is 59.98, which is quite near to 60, the size of this population 
mean, p. Also, the standard deviation of the 648 sample means is 
3.14, which is a bit less than one-third of c. As a matter of fact, 3.14 
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Figure 6.21. Frequency distribution curve for a normal variate, 
for the sampling mean, f, for samples with n = 10. 


Vn = 10/V 10 = 3.16, to two decimals, 


is very nearly equal to ¢/ 
where the symbol n is used to denote the number of observations 


taken in the sample. 

The preceding discussion h 
exhibited by a large number of sam 
population of numerical measurements. 
o predict the actual content of a 


sible to predict the type of fre- 
ans will follow, for example, 


X, and also 


as suggested three features which are 
ple means obtained from a normal 
These features are: 


(a) Although it is impossible t 
particular future sample it may be pos 
quency distribution which the sample me: 
à normal distribution. 

(b) The average sample mean will be of e 


tude as the mean of the population sampled. 
(c) The sample means, Ti, will display less variability than the X’s 


of the population. It is logical that the variability of the sample 
means—írom sample to sample—should decrease as the size of the 
Samples increases. It was suggested that a factor 1/4/A is involved 


here. 


ssentially the same magni- 


ven without. proof because that proof is 
he theorem is stated here for the purpose 
tatements (a), (b), and (c) above 
be used in practice. 


_ The following theorem is gi 
inappropriate to this book. T! 
of replacing the indefiniteness of s 
With precise information which can 
merieal measurements, X, con- 


Theorem. If a population of nu 
bution with mean, £, and stand- 


forms to a normal frequency distri 
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ard deviation, c, and if a very large number of random samples of 
n observations each is drawn from that population: 

(a) The population of z; thus formed will have a normal fre- 
quency distribution. 

(b) The mean of the 2; will be y also. 

(c) The standard deviation of the z; will be ¢/+/n. 


Note from this theorem that, if the X's are normally distributed we 
automatically know the form of the distribution of the sample means 


and hence can write down specifically the formula for their distribu- 
tion curve, namely; 


1 Eu? 
202/n 
hh ———— 
V2r(0/Vn) 


Formula 6.21 can be transformed into the standard normal formula of 
Chapter 4 by means of the substitutions 


(6.21) 


y — yio/N/n. and d= (8 — 1/(0/4/ñ) 


whereupon Table III can be employed as shown earlier. 

If ten measurements are taken per sample and the parent population 
has a mean of 60 and a standard deviation of 10, as above, the mean 
of the z's is also 60, and the standard deviation is 10/4/10 — 4/10 
= 3.16. Table 6.21 and Figure 6.21 furnish approximate verification 
of these statements from actual experience. d 

If n = 15 and the samples are drawn from a normal population with 
u = 60 and c = 10, the mean of the resulting population of 3; also 
will be 60, and the standard deviation will be 10/4/15. On 200 such 
samples, their mean was 60.22 and their standard deviation was 2.53 
instead of the expected 60 and 2.58, respectively. Two hundred is a 
relatively small number of samples from which to 
cation of mathematical theory, but these results 
with the theorem given above. 


seek empirical verifi- 
do agree quite well 


Problem 621. If chemical determinations of th 
samples of a certain variety of wheat are known to have a normal frequency 


distribution with a = 14 and e = 2, what is the probability that five random 
samples will have a mean per cent protein above 16? 


e percentage of protein in 


In the following discussion, c; will be used to denote the standard 
deviation of the population of sampling means, 
= 5,1 = 14, and e; = 2/4/5 
= 222; and P(\ > 2.22) 
only about 13 times in 100 


In this problem, n 
= 0.90. Therefore, 1 = (16 — 14)/0.90 
= .013, approximately. In other words, 
0 sets of 5 observations like these would you 
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have the mean per cent of protein at or above 16. The rarity of such 
an occurrence might cause you to doubt the accuracy of the protein 
analyses and cause you to ask that they be done over. 


PROBLEMS 


1. Given that a certain population of measurements is normally distributed 
about a mean of 30 and with a standard deviation of 8. If a sample of 16 
members is to be drawn at random, what is the probability that its mean will 
be below 28? 

2. Under the conditions of problem 1, what is the probability that a sample 
of 9 numbers taken from that population will have an arithmetic mean below 
28? Ans. 23. 

3. Solve problem 2 with the standard deviation changed to 12. 

A Tf in some particular area the daily wages of coal miners are normally 
distributed with y = $15 and « = $1.50 what is the probability that a representa- 
tive sample of 25 miners will have an average daily wage below $14.25? 

Ans. .006. 

5. Suppose that a thoroughly tested variety of corn has been found to yield an 

average of 35 bushels per acre with a standard deviation of 6, and that these yields 


have a normal frequency distribution. If a random sample of 25 yields for a new 
variety gives z = 40, show that there is good reason to believe that the yields of 
the new variety are from a population with a mean higher than the 35 bushels per 
acre for the population of the older variety. 


UNKNOWN MEAN AND 
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ATION FROM THE 


VARIANCE OF A POPUL 
INFORMATION CONTAINED IN A SAMPLE 

d o are unknown for a particular normal 
led (and they usually are or there 


would be no occasion for sampling) it becomes necessary to estimate 
them from the X’s taken in the sample. How should this be accom- 
plished? Although this really is a mathematieal problem whose 


Solution lies beyond the scope of this book, certain desirable re- 

quirements for sampling estimates of p and o? can be considered. 
First, it seems logical that an acceptable estimate should have a 

Mean equal to the corresponding population parameter after many 


Samples have been taken. Even though only one sample of n measure- 
1d like to know that the 2 and s? 


ments is to be taken, we usually wou ? 
We shall obtain as estimates of » and c? are from populations whose 
Means are y and o°, respectively. Sampling estimates which satisfy 
this requirement are called unbiased estimates, a8 noted. in Chapter 5. 
. The second—and more important—requirement which we should 
impose on a sampling estimate is that it be as reliable as possible in 


If the parameters p an 
Population which is being samp 
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the sense that it have a relatively small variance from sample to 
sample. For example, suppose two methods of estimating p each will 
produce an unbiased estimate but, over many samples, one has a 
variance of 100 whereas the other has a variance of only 25. The 
latter estimate obviously is more consistently near » in size, and 
hence less allowance need be made for sampling error in this estimate. 
This second estimate would be considered a more efficient estimate 
than the one whose variance was 100. 


The estimate of », which already has been mentioned, and whose 
symbolie definition is 


(6.31) g- = 


n 


, 


gives an unbiased and highly efficient estimate of p. It has been 
pointed out earlier in a theorem that the variance of Z under repeated 
sampling is only one nth of the population variance, when a normal 
population is being sampled. (As a matter of fact, the variance of Z 
is c?/n for any population if c? is finite.) Hence the z is widely used 
as an estimate of y. 


The variance c? will be estimated by means of the formula 


(6.32) bat dr 


m—1 5 


This estimate is unbiased and is considered to be about as efficient as 
any estimate of c? as long as the sample is not extremely small. The 
usefulness of this estimate in practice will be illustrated repeatedly in 
subsequent discussions. 

By comparison with the methods used in Chapter 2 to compute c 
or a”, it is seen in formula 6.32 that two changes have been made. 
The is replaced by z and the denominator is now (n — 1) instead 
of n. Logically, the Z must be used because y is unknown; but it also 
must be recognized that the differences, (X; — 2), are more dependent 
upon chance events which occur in the process of sampling than were 
the quantities, (X; — 1). The 3 itself is subject to sampling error 
whereas the 4 is a fixed number for a given population. This matter 
is taken into account in sampling theory. 
is to associate with each estimated varian 
freedom. The estimate s? of formula 6 
degrees of freedom because only (n 


are actually chance differences. 
YX — 4) = DX 


One step in this direction 
ce a number of degrees of 
-32 is said to be based on n — 1 
t — 1) of the n differences (X; — $) 
d n _ This follows from the fact that 
Zz = ni — nī — 0. Hence, given any n — 1 of 


Sec. 63 ESTIMATION OF » AND c? 161 


the deviations of the sample X's from their mean, Z, the other devia- 
tion can be computed without any risk of error. If the true mean, y, 
were known, the n (X; — »)’s would all be quantities whose specific 
sizes depended on chance, and c? could be estimated with n degrees 
of freedom, which is one more than s? has. Also, the estimate made 
with y known would be more reliable than s?, a fact which is asso- 
ciated with its greater number of degrees of freedom. 

As soon as a satisfactory method is available for the estimation of 
a”, it follows that the standard deviation of ¿—which is ¢/+/n and 
is symbolized as ¢;—also can be estimated from the following quantity: 


(6.33) ss = s/ Vn = VX(X — 8/n(n — 1), 


which is calculated from the observations taken in the sample. It 
still is true—as for all sampling estimates—that sz is variable from 
sample-to-sample. . 
Although i is the best specific estimate of the population mean, y, 
it is preferable to caleulate from the sample an interval in which we 
can expeet the true mean to lie, with a measurable degree of confi- 
dence in this expectation. The so-called point estimate, i, is almost 
never exaetly right, but an interval can be defined in such a way that 
We can attach a measure of confidence to the statement that n lies in 
this interval. This problem can be solved by means of a ratio which is 
analogous to the (X — 1)/0 which was studied in Chapter 4 ay 
ratio involves only one variable, X, and follows the normal distribu- 
tion. So also does the ratio (8 — 1)/0% 1f the standard deviation, 
7, is not known—which is the usual sampling situation—the corre- 
Sponding ratio 
(6.34) e t= (a —2/s 


involves a variable denominator and is not normally a s 
degree of departure from normality depends on the size = D — e, 
n, because the denominator is much less variable for the larger e 

Mathematicians have derived a formula for the sa pans A aa 
tion of the ratio, £ for a sample of any size. anon S E 
is not appropriate to this book, sampling experience + i phe 
approximation to this distribution, and then mathema ica a e 
be provided which give the same information more accurately 


more easily. ete 

Table 6.31 presents the frequency and the Y^ eq req oie be 
Sampling ¿s obtained from random samples e pated inar 
normal population of Table 6.21. All samples 
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observations. The /'s were computed from formula 6.34, using » = 60 
and the z's obtained from the samples. For example, if the Z = 58.2 
and sz is calculated from formula 6.33 to be 2.61, t = (58.2 — 60)/2.61 
= —0.69. 

TABLE 6.31 


OBSERVED FREQUENCY DISTRIBUTION OF 580 t; OBTAINED FROM SAMPLES OF 
10 Members Each Drawn FROM A NORMAL POPULATION WITH u = 60 AND 


o= 10 
Class Interval $ ref. Class Interval f nef. 
23.60 1 1.00 —2.80 to —2.01 17 .04 
2.80to 3.59 4 1.00— —3.60 to —2.81 3 .01 
2.00to 2.79 16 .99 « —3.60 1 .00 
1.20to 1.99 54 .96 — 
0.40 to 1.19 127 .87 Total 580 
—0.40 to 0.39 187 .65 
—1.20 to —0.41 115 Ss] Arithmetic mean = +0.015 
—2.00 to —1.21 55 .18 Standard deviation — 1.10 


Figure 6.31 presents the frequency and the relative cumulative fre- 
quency distribution curves corresponding to Table 6.31. The r.c.f. 
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Figure 6.31. Frequency distribution of 580 sample values of t drawn from a 
normal population with » — 60, c = 10, and n = 10. 


curve of Figure 6.31 furnishes information concernin 
of t for n = 10 which is entirely analogous to that to be had from 
Table III for normal frequency distributions. Figure 6.31 shows 
that: (a) The point where ¢ = 0 on the horizontal axis divides the 
population of t's into two equal portions, each containing 0.50, or 
50 per cent, of the whole population (as with the normal distribution 


g the population 
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ata — 0). (b) Approximately 95 per cent (as nearly as can be told 
from the graph) of the ¢; are less than or equal to 4-2 in magnitude. 
(c) The middle 80 per cent of the 's with n = 10 fall within the 
limits —1.5 to +1.5, approximately. Such information will be seen 
to be needed in arriving at the interval estimate for p described above. 

It should be noted that conclusions (b) and (c) of the preceding 
paragraph referred only to samples with n = 10. The general effect 
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Figure 6.32. Relative cumulative frequen o 
25 (broken line). 


(solid line) and for n= 


of the magnitude of » on the frequency distribution of the ts is 


ilustrated in Figure 0.32 for n2 5 and n= 25. The larger the 
sample size, the less dispersed are the t’s. In fact, after n becomes 
as large as 25 it is difficult to detect much difference between 
the rcf. curve for ¢ and that for a normally distributed measure- 
ment. Also, if n, is smaller than ns, the ogive for samples with Ln 
Observations will be above that for samples with Ta observations for 
negative ¢’s and below it for positive t's. This is just a graphic verifi- 
cation of the fact that the é’s are more dispersed for the smaller- 


Sized s o 

In oe the earlier discussion of degrees of freedom for the 
estimate of the standard deviation, the £ is said to have the D 
number of degrees of freedom as the standard hes in k e 
denominator of this ratio. The ts considered so far ave one less 
degree of freedom than the size of the sample, that is, n — 1. 
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Table IV provides an r.c.f. distribution of the sampling ratio, t, 
for most of the commonly used sizes of samples. This table is in the 
form of those r.c.f. distributions discussed in Chapter 2. This form 
is different. from that found in most statistical tables, but the form 
of Table IV fits the purposes of this book better than the traditional 
table. However, the values in the more usual table can be derived 
from Table IV quite easily. For example, by Table IV the prob- 
ability that a random sampling ¢ will have a size below —2 from a 
sample of 15(14D/F) is seen to be .033. Because the t-distribution 
is symmetrical about ¢ = 0 the probability that a t computed with 
14 D/F will exceed 2 numerically is twice .033, or .066. This is the 
probability given in the usual table for t — 2 and 14 degrees of free- 
dom. To obtain such a number as .066 for P in those tables we must 
interpolate because they give the sampling t's which correspond to 
specified values of P. 

Table IV will be employed in subsequent discussions instead of the 
r.c.f. curve because it is both more accurate and more convenient to 
do so. However, the reader should remember that the two methods 
are basically the same. The use of tables for the t-distribution is 
especially advantageous because there would have to be a different 
r.c.f. graph for each number of degrees of freedom. 

Suppose, now, that the true population mean, p is not known. In 
spite of our ignorance of the size of y it remains true that sampling 
values of ¢ will conform to the t-distribution. For example, for 
n = 10(9 degrees of freedom), it will be true that 92 per cent of the 
Üs will lie between —2 and +2 (see Table IV). Or, put in terms of 
a mathematical inequality, it remains true that the following state- 


ment is correct for 92 per cent of a very large number of samples 
with n = 10: 


(6.35) asi 5 2 ga 
Sz 


Approximate empirical verification of the truth of this inequality is 
found in Table 6.31 above. 

In view of the information just given, the following can be said: If 
we are about to take a random sample of 10 numerical measurements 
from a normal population, the probability is .92 that the ¢ for this 
sample will satisfy inequality (6.35) because 92 per cent of all samples 
with 9 degrees of freedom do lie within the limits —2 to +2. When 
the y is not known this statement still is true but we can compute the 
t only zn terms of the u. To illustrate, suppose that a random sample 
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of 10 observations taken from a normal population has given $ — 8 
and s; = 2. Then: = (8 — u)/2, a function of p. Before the sample 
was taken it could be reasoned that there were 92 chances in 100 that 
the t to be obtained would have some size between —2 and +2. Like- 
Wise, after the sample is taken, the assumption that the ¢ does lie 
within these limits runs a risk of 8 in 100 of being wrong as a result of 
sampling variation. 

What does the assumption that the £ obtained from the sample 
satisfies the inequality (6.35) require of p now that t = (8 — ,)/2? 
The quantity (8 — uy) /2 must be at least as large as —2 but no 
larger than +2; therefore, (8 — ») must be at least as large as —4 
but not larger than 4-4. It follows that » must be some number 
from 4 to 12 unless an 8-in-100 event has occurred. We never actu- 
ally know in practice if such a £ has been got; but we do know that 
the odds against it are 92:8. 

The probability, .92, associated with the expression (6.35) is called 
the confidence coefficient for the confidence interval 4 to 12 because it 
measures the confidence we can put in the inference that y lies within 
these limits. This usage is identical with that of Chapter 5. That 
is, a method for basing decisions on sampling evidence has been 
presented; and, although we know it is not infallible, we know what 
risk of error we run when we use the method. 

Obviously, other confidence coefficients besides .92 could be used. 
For example, 95 and 99 per cent confidence limits are quite common. 
They require the use of the following inequalities for 9 degrees of 


freedom: 
am confidence limits, Clos. 


—2.26 < pe < +2.26 for 95 per cent 


Sz 


a 


uai. < +3.25 for 99 per cent confidence limits, Clos. 


Sz 


3.25 < 


be put into a more con- 
ach (all three members) 
bers of the inequali- 
intervals 


These two inequalities and that of (6.35) can 
venient form simply by multiplying through e 
by 5; and then transferring the 3 to the outer mem 
ties. The final results for 92, 95, and 99 per cent confidence 


are as follows for 9 degrees of freedom: 
(6.36) (@ — 282) Eu E (g + 2sz) for a Clos; 
(6.37) (E — 2.2652) <u < (E + 2,2653) for a Clos; 


(6.38) (@ — 3.25s;) <u < (E 


L 3.25s;) for a Clog- 
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The latter two are used quite commonly for estimates with 9 degrees 
of freedom; the first inequality is used here chiefly for convenience of 


illustration. 


As an application of the above-described methods, suppose that you 
wish to learn what the average life of a certain type of light bulb is. 
Suppose that ten sample bulbs of this type are left burning until all 
have burned out, and the time it took each to burn out is recorded. The 


TABLE 6.32 


OUTLINE or SOME SAMPLES FROM A NEAR-NORMAL POPULATION, WITH p = 60 
(Samples were taken by statistics classes from the population of Table 6.21.) 


Sample Confidence Limits on 

Number T E t 80% 90% 95% 
1 58.4 6.04 —0.86 55.7-61.0 54.8-61.8  54.0-62.7 
2 60.9 12.64 0.24 55.4-66.5  53.6-68.2  51.5-70.4 
4 56.4 5.81  —1.96  53.8-59.0* 53.0-59.8* 52.2-60.6 
5 59.0 11.09 —0.27  54.2-63.9  52.6-65.4  51.1-07.0 
6 53.4 10.33  —2.02  48.9-57.9* 47.4-59.4* 46.0-60.8 
7 52.9 6.13 —3.00 50.2-55.6* 49.3-56.5* 48.5-57.3* 
8 67.7 10.21 2.38  603.2-72.2* 61.8-73.6* 60.4-75.0 
9 54.0 10.52  —1.80  49.4-58.6* 47.9-60.1  46.5-61.5 
10 54.1 8.75  —2.13  50.3-57.9* 49.0-59.2* 47.8-60.4 
So MM MOI EPA RE 700 otn? 
302 61.8 12.67 0.45 

303 63.2 7.69 1.32 

304 61.6 8.04 0.63 

305 58.4 10.69  —0.47 

306 64.6 13.39 1.09 

307 64.5 6.33 2.25 * * 

308 55.9 7.65  —1.69 * 

309 60.0 15.19 0 

310 55.5 12.95  —1.10 

DISENO TL LONE UN" Ekg) YE Sors 


SUMMARY or 578 CONFIDENCE INTERVALS 


Confidence 
Coefficient 


Limits Did Limits Did Not 
Include u Include u 
Number % Number 96 
460 79.6 118 2 
517 89.4 "ME. 
559 96.7 ; 


19 3.3 
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following results will be assumed to have been obtained: 3 = 1400 
hours and s; = 70 hours. The inequalities above become the follow- 
ing after simplification: 1260 <4 € 1540, 1242 < y X 1558, and 1172 
X u < 1628 hours, respectively, if the computations are rounded to 
the nearest whole hour. If you act on the assumption that the true 
average life of this type of bulb is between 1260 hours and 1540 hours, 
you run a risk of 8 in 100 that the sample has misled you. However, 
if the widest limits, 1172 to 1628, are used, the risk of an erroneous 
assumption is only 1 in 100. 

Table 6.32 has been included to illustrate further and to clarify 
the idea of confidence intervals. It contains some sampling results 
obtained from a normal population with y = 60, and then = 10. A 
summary of 578 samples is shown at the bottom of the table. Not 
all the sampling results are given; just enough to satisfy the pur- 
poses of this discussion. The asterisks indicate those intervals which 


fail to include y. 
Some of the points whicl 
following: 


h are illustrated by Table 6.32 are the 


are long-run relative frequencies 
number of samples. If atten- 
10, the confidence coefficients 
8 confidence intervals, 


(a) The confidence coefficients 
which are verified only after a large 
tion were confined to samples 4 to 
would seem to be wrong; but over the set of 57 
they are verified quite satisfactorily. : 

(b) The determination of a confidence interval is doubly depend- 
ent on chance: once as regards the mean, and again regarding the 
Magnitude of the standard deviation. For example, samples 306 and 
307 had essentially the same mean but the standard deviations were, 
by chance, so different that even the 80 per cent limits from sample 
306 included the true mean, 60. Only the 95 per cent confidence 
interval from sample 307 includes p- On the other hand, samples 
303 and 308 have practically the same standard deviation, but the 
sample means are so different that the 80 per cent limits from sample 


308 failed to include the true mean. 

(c) The confidence interval is wider for the larger confidence co- 
efficients, that is, the more certain we choose to be in our conclu- 
Sions, the more room we must leave for sampling variations. 

Problem 6.31. Suppose that a highway commission is deque. F the 
strength of concrete which it wishes to make for highway projects, and that it 

d samples will be the best 


concludes that the 7-day tensile strengths of standar 
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criterion of quality. Suppose also that ten of the standard testing models gave 
these results: 


Z = 439.0 pounds per square inch, s = 47.0 pounds per square inch. 


What valid and useful conclusions could they draw concerning the true average 
tensile strength of this concrete? 


Although the true average strength, p, is a hypothetical strength 
rarely possessed by an actual sample, it does provide a useful de- 
scription of the tensile strength of a type of concrete. Before a con- 
fidence interval can be put on p» a confidence coefficient must be 
chosen. Such matters as the seriousness of committing an error, and 
the added cost of demanding narrower limits, are involved in this 
decision. However, for purposes of illustration it will be assumed 
that a risk of 1 in 20 of obtaining a confidence interval not including 
p is appropriate to these circumstances. Then, using inequality 


(6.37) because n = 10 and 95 per cent limits are sought, we obtain 
the following: 


439 — 2.26(14.9) < y < 439 + 2.20(14.9) 


because sz = 47.0/4/10 = 14.9 pounds per square inch. When this 


inequality is simplified it is found that the 95 per cent confidence inter- 
val is 


405 pounds per square inch < p < 475 pounds per square inch, 


to the nearest 5 pounds. Therefore, the true average tensile strength 
of this concrete will be considered to be somewhere between 405 and 
475 pounds per square inch; but, at the same time, it will be kept 
in mind that there is 1 chance in 20 that this sample has been “wild” 
and hence has led to an incorrect conclusion. 

If the reader thinks a bit about the material in this section as 
compared to the corresponding section in the preceding chapter on 
binomial populations, it should become apparent that these two sec- 
tions have a great deal in common. In both, a sampling distribution 
was studied, and we were concerned with the relative frequencies 
with which certain sampling phenomena would occur, In particu- 
lar, we were interested in the relative frequencies with which inter- 
vals determined from samples would include the unknown population 
parameter. This probability is the confidence coefficient, 

There also are some differences which could be pointed out. A 
major one is that owing to the discontinuit 


iajo y of the binomial frequenc 
distribution, the confidence coefficient is : x 


the lower limit on the rela- 


—— " 
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tive frequency with which the confidenee interval will include the 
parameter. Basically, however, the methods of these two chapters 
involve the same kind of statistical inference. 

. It may have occurred to the reader to wonder why the confidence 
interval is taken in the center of the sampling distribution. Al- 
though it is true that 92 per cent of all sampling ¢’s with 9 degrees 
of freedom will have sizes between —2 and +2, it is also true that 
92 per cent of all sampling £s with 9 degrees of freedom will lie be- 


tween —5 and +1.54 (see Table IV). Therefore, the inequality 


= — 
Ë < +1.54 


—5.0 < 
Sz 


also will be true for 92 per cent of all samples with 9 degrees of free- 
dom. Why not use this inequality as the basis for computing the 92 
per cent confidence interval instead of the one suggested earlier? 
Suppose the inequality above is used on the example used previously 
in which Z = 8 and s; = 2. The 92 per cent confidence interval now 


is from 5 to 18 instead of the shorter interval, 4 to 12, obtained previ- 


ously. It always will be longer when a non-centrally located interval 


on tis used. It should be clear that the shorter the confidence interval 
for a given confidence coefficient, the better the interval estimate. 


Why be more indefinite than is necessary? 


PROBLEMS 


1. Verify the 80, 90, and 95 per cent confidence intervals given in Table 6.32 
for samples 1 and 2. 

2. Compute 99 per cent confidence 
6.32 and interpret them. 

3. Given that # = 35 and s = 10 for 


and 9 of Table 


e limits on # for samples 8 
432 £ p 048. 


Ans. 5712 £ p £ 18.2. 
a sample of ten observations compute and 
interpret the 95 per cent confidence interval on #. Do the same for n = 15 and 
n = 20 and compare them. What is the implication regarding the relation be- 
tween the size of the sample and the width of the confidence interval, every- 
thing else being equal? 

4. Given that the ¢ was computed to be — 
6.32, determine whether or not the 90 per ce 
Do likewise for 95 per cent limits. 
Ans. Clgo does not incl 

5. Use Figure 6.31 to determine the 86 per cent con 
from sample 6 of Table 6.32. 

6. Suppose that an improved 
average of 5 bushels per acre more yi 
21 plots. Also assume that the stand 
bushels per acre. What are the 95 per cent con 


2.08 for sample number 525, Table 
nt confidence interval includes 4. 


ude & = 60; CI, does. 
fidence interval on # 


] method of cultivating wheat has produced an 
eld than an older method on a sample of 
ard deviation on this sample is s=5 
fidence limits on the true aver- 
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age additional yield produced by the new method? Suppose that the new 
method costs $5 per acre more to use than the older method. What can you 
say about the probable economie advantage obtained from the new method if 
wheat is currently bringing $2.25 per bushel? 

Ans. Clg;: 2.65 — u — 7335 bushels. Gain = 96 cents per acre. 

7. Suppose that in problem 6, s had been 10 bushels per aere. Show how this 
increase in sampling variability among the 21 plots changes the answers to the 
questions asked in problem 6. 

8. Suppose that in problem 6, the sample had involved but 10 plots. Show 
how this decrease in the size of the sample changes the answers to the questions 
asked in problem 6. 

9. Suppose that chemical analysis shows that the mean per cent protein for 16 
wheat samples is 14.28 and that the estimated standard deviation for the popula- 
tion of z's being sampled is sz = 2.00. What conclusions can you draw from the 
99 per cent confidence interval on the true mean y? 

10. If basal metabolisms determined for a random sample of 25 sixteen-year- 
old Kansas girls produced z — 45.80 calories per square meter per hour, with!s = 
0.50, what are the 80 per cent confidence limits, and what information do they pro- 
vide in setting up a standard for sixteen-year-old Kansas girls? 

Ans. Clso: 45.67 < p < 45.93 calories per square meter per hour. 

11. Suppose that during a recent period of strong prices twenty-five 450-pound 
choice steer calves were purchased, October 15th, wintered on silage and one pound 
of cottonseed meal per day, and then sold on April 15th as choice stocker steers. 
If the average net income per animal was = $25 with s = $10, place a 90 per 
cent confidence interval on the true average net income per animal for the popu- 
lation so sampled. A similar sampling of choice 600-pound yearling steers pro- 
duced a 90 per cent confidence interval of $105 to $130 net income per steer. What 


conclusions ean you draw regarding the most profitable choice for a cattleman to 
make between these two systems? 


6.4 A STATISTICAL TEST OF A HYPOTHESIS 
THAT A GIVEN SAMPLE CAME FROM A 
NORMAL POPULATION WITH A 
SPECIFIED MEAN 


The general problem of deciding whether or not a particular sample 
came from a normal population whose mean, p, is specified but 
whose standard deviation can be estimated only by means of s has 
received consideration earlier in this chapter. In practice, the speci- 
fication of » is based upon a hypothesis about the population under 
study. For example, if a new method of cultivating whe 
produce higher average yields the population of differen 
between the new and old methods grow: 
of land will have a true mean p=0b 
is no advantage to the new 
is found from statistical ana 


at does not 
ces in yield 
n in a series of paired plots 
ecause, on the average, there 
method. If the hypothesis that us 0 
lysis to be unreasonable in view of the 


Mie o En 
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sampling results, that hypothesis should be rejected. However, if 
the evidence in the sample is in reasonable accord with that hypothe- 
sis, it should be accepted. This is the idea behind the methods to 
be presented in this section, and also of all so-called tests of signifi- 
cance. 

A generally satisfactory solution to the problem of this section can 
be obtained from the ¢-distribution when normal or near-normal popu- 
lations are being sampled. As the reader already knows, t = (€ — 1)/sz 
and has n — 1 degrees of freedom if the sample contains n observa- 
tions. When the y is specified by the hypothesis to be tested, the £ 
can be caleulated. Thereafter, we can determine from Table IV how 
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Figure 6.41, Illustration of the effects on the t-distribution of a false hypothesis 
regarding q. 


uncommon such a ¢ is when the samples are drawn from the supposed 
population. For example, if n = 14 and ¢ turns out to be 0.90 on the 
assumption that y = 0, we learn from Table IV that about 38 per cent 
of all sampling £s with 13 degrees of freedom are numerically larger 
than 0.90. Therefore, this value of ¢ is not at all unusual and hence 
we would have no reason to doubt the hypothesis that y = 0. But, 
suppose that ¿ had been 3.0. It is seen in Table IV that only about 
one ¢ in 100 from samples of this size ever gets as large as 3, numeri- 
cally. Hence, we might reasonably doubt that y really is zero because 
t rarely attains such a size when the hypothesis being tested is true. 

To illustrate the above discussion graphically, suppose that the true 
mean, y, of a normal population of measurements actually is 2 but 
Owing to some error in reasoning q is considered to be 0. What effect 
does this have on the frequency distribution of t? For this situation, 
t really is (2 — 2)/s; but because of the error regarding p the values of 
t are calculated from the formula t = 3/sz. In view of the fact that 
tı = (& — 0)/s; is just 2/s units larger than & = (3 — 2)/sz, we are 
actually sampling population B of Figure 6.41, but we think that we 
are sampling from population 4. The discrepancy should, and would, 
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show up through an excess of large ts beyond the proportions pre- 
dictable from Table IV. To be more specific, suppose that attention 
is centered upon a particular ¿ such that p per cent of the ¿-distribution 
lies to the right of this point. Such a point is indicated in Figure 6.41 
as ly. It is noted from this figure that a much larger fraction of the 
true /'s (figure B) lie to the right of ¢, than is true for the population 
resulting from the calculations with the false value for y. This dis- 
crepancy between the hypothetical and the actual situation will show 
up in the sampling. Obviously, the greater the discrepancy the more 
easily it is detected by sampling. 

In practice it is not feasible, efficient, or economical to continue to 
draw samples from a population until the evidence for or against a 
certain hypothesis is so overwhelming that there is virtually no doubt 
of its truth or of its falsity. Instead, it is common to take what is 
considered an adequate number of observations on the population, 
choose the risk we shall take of rejecting a true hypothesis, and then 
reject the hypothesis being tested if t goes beyond that predetermined 
limit. To illustrate, suppose that a sample of 15 observations is to 
be taken under conditions which specify the population being 
sampled, and that it is decided that it is appropriate to take 1 
chance in 20 of rejecting a true hypothesis. For 14 degrees of free- 
dom, a t which is at, or above, 2.15 numerically (see Table IV) will 
occur about 1 time in 20 when the choice of y is entirely correct. If 
we decide to regard all £s which are outside the interval -2.15<t< 
+2.15 as being the result of a false hypothesis regarding y, we run 


a risk of 1 in 20 of rejecting a true hypothesis as a result of sampling 
variations. 


Problem 6.41. Suppose that some educators test two proposed teaching pro- 
cedures in the following way : 
(1) All available records and the opinions of teachers a; 


. re applied to the 
selection of 20 students who, as a group, do 


a good job of representing students 
who will be studying the materials upon which the test is to be based. 


(2) Two equally difficult sections of subject matter are carefully 


chosen. 
(3) The group of 20 students is taught one section by method A, the other 
by method B. 


(4) Two equally difficult examinations, one on each Section of the subject 
matter, are formulated by competent teachers and given to the 20 students. 

(8) The average difference, student-by-student, between the two test scores 
is to be used as the measure of the difference in efficiency between the two 
methods of instruction. . 


Sec. 64 TEST OF HYPOTHESIS Ho(z = uo) 173 


It will be assumed herein that the following test scores were made under the 
two methods: 


X 
Student Grade on Method 
Number A B (A — B) 

1 90 85 5 
2 72 73 -1 
3 86 80 6 
4 7 75 3 
5 97 95 2 
6 85 81 4 
T 64 50 14 
8 69 65 4 
9 76 70 6 
10 79 70 9 
11 sl 78 3 
12 83 83 0 
13 75 7l 4 
14 85 80 5 
15 72 69 3 
16 100 90 10 
17 88 82 6 
18 77 65 12 
19 80 70 10 
20 73 65 8 


What conclusions can we draw validly from these results? 

In the usual manner it is found that 3 = 5.65 in favor of A, and 
that s = 0.87. Therefore, t = (5.65 — 1)/0.87, with 19 degrees of 
freedom. What is a reasonable hypothesis regarding the magnitude 
of y in the population of X; = A; — B; assumed to follow a normal 

his study was to determine if one 


distribution? The purpose of t 
method of instruction is better than the other, and, perhaps, assess 
If one method is superior 


the magnitude of the difference if one exists. 
to the other, u is not equal to zero; however, there appears to be no 


logical way to decide ahead of the test just what the size of 4 might be. 
The problem therefore is attack by assuming that » = 0 and then de- 
termining statistically just how satisfactory such a hypothesis is. 

If a = 0, t = (5.65 — 0) /0.87 = 6.49, with 19 degrees of freedom. 
It is clear from Table IV that a ¢ of this size is an extremely rare 
occurrence, a fact which leads us to reject decisively the hypothesis 
that y = 0. In other words, method A most certainly is some better 
than method B. If there is any benefit to be derived from an estimate 
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of the magnitude of », now that the hypothesis Ho(y = 0) is rejected, 
confidence intervals can be obtained at any appropriate level of 
confidence. 


PROBLEMS 


1. Suppose that 20 pairs of college students have been so selected that the 
members of each pair can be considered equal in intelligence, scholastic records, 
and in other factors associated with the making of good scholastic records in 
a college. Suppose one member of each pair is enrolled in a class in social 
science which is to be taught by a discussion method emphasizing analysis of 
problems and reflective thinking; whereas the other member of each pair is put 
in a class taught by a more formal lecture-recitation procedure. The subject to 
be studied is the same in each class, and the teachers are considered to be equal 
in teaching abilities. At the end of the teaching period all 40 students are given 
the same examination with the following results: 


Grade Grade 
Pair Discussion Lecture Pair Discussion Lecture 
1 120 110 11 115 108 
2 79 75 12 103 91 
3 65 70 13 75 70 
4 67 75 14 92 95 
5 80 75 15 105 102 
6 85 80 16 82 78 
7 98 90 17 78 76 
8 110 95 18 87 90 
9 108 92 19 131 120 
10 86 80 20 50 51 


The ,discussion method appears to be the better for producing good test 
scores, but there is considerable variation. Test the hypothesis that the two 
methods actually are equal on the average (that is, 4 = 0) and dr: 
conclusions. 

2. A study was made to determine if tomatoes high up on a plant have more 
ascorbic acid (vitamin C) than those lower down on the same staked plant. To 
study this matter, 10 pairs of red-ripe tomatoes were taken from 10 plants, with 
one member of each pair being from the fifth cluster and the other from the 
seventh cluster from the bottom of each plant. Each of the tomatoes from the 
seventh cluster had more vitamin C than the corresponding tomato from the 
fifth cluster by the following respective amounts: 


aw appropriate 


X: 66, 116, 109, 74, 88, 103, 7.4, 78, 58, and 4.0 milligr 


Given that ZX = 80.6 and ZX? = 700.46, compute a 90 per cent confidence inter- 
val on the true average amount by which the ascorbic acid in tomatoes at the 
seventh cluster exceeds that at the fifth cluster on the same plant, and draw 
appropriate conclusions regarding x. 


Ans. Clog: 6.7 = 


ams/100 grams. 


4.— 94 milligrams/100 grams. 
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3. Solve as in problem 2, using the following data from the sixth and eighth 
clusters of 10 plants: 


X: 70, 133, $6, 64, 83, 99, 26, 9.1, 6.6, and 16. 


You are given that XX = 734 and EX? = 643.40. 

4. During the winter tomatoes often are shipped green and allowed to ripen 
in the package. Aside from matters of flavor and appearance, it is of interest 
to know what effect this practice has on the vitamin C concentration in the 
fruit. Two tomatoes were picked from each of 18 plants and at the same 
cluster on the plant. One was red-ripe, the other was green (no red or yellow 
coloring). The red-ripe member of each pair was analyzed immediately for 
vitamin C; the other was ripened at room temperatume out of the sun until 
red-ripe before its vitamin content was determined. Then the differences in 
vitamin C between members of pairs was determined with the following results: 


2X = 4937 milligrams/100 grams, favoring vine-ripened tomatoes, and 
IX? = 387.5911. 


Determine statistically if there probably is a loss in ascorbic acid which is due 
to picking tomatoes green and letting them ripen on the way to market or on 
the shelf, 

5. Suppose that a sociologist has conjectured that the average rent for two-room 
furnished apartments in a certain section of a city is $90 per month. A sample 
of 20 apartments had Z = $82.50, with s = $8. Use the t-test to determine if the 
hypothesis Ho (u = 90) is acceptable when sampling variance is taken into account. 

6. Suppose that a timber cruiser has judged that the average breast-high diam- 
eter of a certain stand of timber is 2 feet. Is the timber cruiser's estimate reason- 
able if 31 trees are selected at random with these results: # = 2.3 feet and s = 0.8 
feet? 

7. Suppose that a store conducts a study of the comparative net profits from 
roasting ears sold in cellophane packs as compared to the loose ears in the husks. 
The experiment is conducted for the 26 business days of a month. At the end of 
each day, the net profit per ear is figured for each way of selling the corn. The 
average advantage of using the cellophane pack on these 26 daily comparisons was 
2 cents, with a standard deviation s = 0.5 cent. When sampling variation is con- 
sidered, was the average advantage of the cellophane pack enough to justify the 
Conclusion that it is really more profitable? 

8. If, in a certain investigation, Z = 10.5 and s = 3, how large must n be to 
cause the rejection of the hypothesis: Ho ( = 0) at the one per cent level? 

9. If in a sampling study to which the t-test is appropriate the is 28 and s = 5, 
how large must be before the hypothesis Ho (u = 0) will be rejected at the 5 per 
cent level? 

10. Suppose that 27 pairs of plants of a cer 
close similarity and are planted in pairs as ¢ 
this species. One member of each pair has had some boron added to the fer- 
tilizer; otherwise the plants are treated identically. A If the 30 plants having 
boron outgrew their partners by an average of 3.6 centimeters, with the standard 
deviation of the difference being s = 12 centimeters, 1s this sufficient evidence 
for the statement that the addition of the boron produces some additional 
growth? 


tain species have been selected for 
lose together as is appropriate for 
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6.5 A STATISTICAL TEST OF THE HYPOTHESIS THAT 
TWO SAMPLES OF OBSERVATIONS HAVE BEEN 
DRAWN FROM THE SAME NORMAL 
POPULATION OF NUMERICAL 
MEASUREMENTS 


If two samples have been taken under the same conditions but 
with some one important feature changed, we usually wish to learn 
if this change has produced a new population of measurements. For 
example, if two groups of Duroc-Jersey pigs have been fed two dif- 
ferent rations, the experimenter wants to know if the difference in 
ration has produced an important difference in average daily gains. 
That is, has the difference in ration created different populations of 
average daily gains? Fundamentally, the method to be employed 
in the solution of this problem is the same as that described in the 
preceding section, but the mechanics of the procedure need to be 
altered to fit the new sampling situation. 


The following symbols will be employed: 


d; = ii — da; = the difference between the ith sample mean from 
samples from group 1 and the corresponding sam- 
ple from group 2, and 

sq = the standard deviation of the d;. 

Before the general method for attacking the problem just posed is 
described, some actual sampling experiences will be presented in tabu- 
lar form, and discussed. Table 6.51 shows a summary of 403 d; ob- 
tained from pairs of samples, each with n = 10 drawn from the near- 


normal population of Table 6.21. It is recalled that the standard 
deviation of that population is c = 10. 


TABLE 6.51 


FREQUENCY AND r.c.f. DISTRIBUTIONS FOR 403 SAMPLE VALUES or d; WITH 
n = 10 Drawn FROM A NEAR-NonMAL POPULATION WITH 4 = 60 AND o = 10 


Class Interval f ref. Class Interval F rcf. 
16.5-19.4 il 1.000 — 45to — 1.6 88 .342 
13.5-16.4 2 .998 — 7.5 to — 4.6 33 .124 
10.5-13.4 3 .993 —10.5 to — 7.6 15 .042 
7.5-10.4 14 .985 —13.5 to —10.6 1 .005 
4.5- 7.4 36 .950 —16.5 to —13.6 1 .002 
1.5- 4.4 80 .861 UE 

—1.5- 14 129 .663 Total 403 


Arithmetic mean of d; = +0.06; standard deviation = 4.43. 
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The frequency distribution in Table 6.51 displays one notable con- 
trast to that of the z's of Table 6.21, namely, the d; are more variable. 
Asa matter of fact, the standard deviation of the d; is greater than that 
of the z; by a factor of about 1.4 in this instance in which n — 10. It 
can be shown mathematically that the factor theoretically is 4/3, 
which = 1.414, approximately; hence the empirical results of Table 
6.51 agree quite well with the theory. 

The following theorem summarizes some of the above information 


and makes it more precise: 


Theorem. If a very large number of pairs of independently drawn 
samples of » observations is taken from a normal population with 
standard deviation — c, then: 

(a) The population of differences d; = di; — dei 
to the normal distribution. 

(b) The arithmetic mean of the population of d; is 0. 

(c) The standard deviation of the population of d; is 


og = oV 2/n. 


will conform 


For the situation summarized in Table 6.51, c3 = 10V 2/10 = 4.47, 
an amount which agrees quite well with the 4.43 shown in that table 
as the observed standard deviation for 403 d's. 

In practice, the standard deviation, e, nearly always is unknown 
so that an estimate must be made from the sample. When a pair of 
samples has been taken it has been determined by mathematical 
analysis that the best procedure to follow is this: Lump together, 
or pool, the sums of the squares of the 2; in each sample taken sepa- 
rately and divide that sum by 2(n — 1) before taking the square 
root. In symbols, the following is the recommended estimate of o: 


2 Sao? 
(6.51) s € s ) 


s of the deviations of the X’s of the 
likewise for D(x). 
ed, we obtain the following formula 


where X(x?) = the sum of square 

first sample from their mean; and 
When the theorem above is appli 

for the sampling estimate of 07: 


E(x?) + ZG») 
(6.52) a= sV2/n a Em 
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It turns out from mathematical analysis that the sampling ratios 
t; = (d; — 1)/sa, follow the same sampling distribution as the ¢ previ- 
ously discussed, if y = the true average d;; hence Table IV can be 
used here provided we employ 2(n — 1) degrees of freedom for ¢ in- 
stead of the (n — 1) used before. 

'The way is now open to solve the type of problem proposed at the 
beginning of this section. To illustrate, suppose that 20 steers of the 
same breed, weight, and previous history are divided into two equal 
lots by some impartial means such as drawing numbers from a hat. 
Thereafter, one group is fed a ration 50 per cent of which is peanut 
meal and 50 per cent is a standard ration. The other group of steers 
is fed only 20 per cent peanut meal, the remainder being the same 
standard ration. After an adequate period of time, the average daily 
gains of the steers were obtained as follows, with A standing for the 
group of steers whose diet contained 50 per cent of peanut meal: 


Group 
1.55 Ib 1.66 lb 
1.68 1.82 
1.42 1.71 
1.45 1.78 
1.52 1.69 
1.58 1.73 
1.56 1.75 
1.61 1.61 
1.54 1.90 
1.48 1.72 
15.39 17.37 


Z(z:°) = 0.0531, E(x) = 0.0608. 


_ [0.0531 + 0.0608 
Sá = ————— 
1010 — 1) 
0.001266 


0.036, approximately. 


t = (0.20 — 1)/0.036 = 5.56 if y = 0. 
t has 18 degrees of freedom. 


LU 


We learn from Table IV that less than one-half of one per cent 
of the sample t's with 18 degrees of freedom are numerically as large 
as 5.56; therefore, the hypothesis that y = 0 is rejected and the two 
samples are regarded as having been drawn from different normal 


Sec. 6.5 TEST OF HYPOTHESIS Ho(ui = 12) 179 


populations of average daily gains. It is concluded that the steers 
on a diet containing 50 per cent peanut meal will, on the average, 
produce lower gains than those on only 20 per cent peanut meal. 

Ordinarily the experimenter would wish to carry the statistical 
analysis farther than this by means of confidence intervals. -If the 
steers on 20 per cent peanut meal do not gain enough more to pay 
for the added expense of using more of the standard ration which 
costs more than the peanut meal, it still may not pay to use the diet 
B. If 95 per cent confidence limits are chosen here, they are deter- 
mined by the usual methods from 


—9.10 < (0.20 — 1)/0.036 < +2.10; or 
0.12 <p < 0.28. 


Therefore, it can be concluded with considerable confidence (asso- 
ciated with odds of 19 to 1) that the average advantage due to feed- 
ing 20 per cent peanut meal instead of 50 per cent is at least 0.12 
pound of gain per day but not over 0.28 pound. Given the current 
price of steers of the sort under study, we can decide which ration 
is economically preferable. Obviously, other factors would be con- 
sidered in practice, but they are separate considerations. 

Although it seems preferable in studies such as those illustrated in 
this section to have equally many observations in each group, this 
is not always an attainable goal. If the sample sizes are unequal, 
say nı and ms instead of n each, the above methods are applicable 
but the formulas are changed to fit these new circumstances. For- 


mula 6.51 is replaced by 


S (a2) + Maz”) 
(6.53 "E RII ; and 
) i m + n2 — 2 


formula 6.52 is replaced by 
(6.54) s; = s'V/1/n, + 1/n2 = 


Formulas 6.51 and 6.53 are fundamentally the same in all important 
respects; each is an estimate based on the deviations (X1; — 4) and 
Xs; — 3) in both samples. Likewise, formulas 6.52 and 6.54 are 
fundamentally alike; each comes from the theorem of mathematical 
Statisties that the variance of the difference between the means of pairs 
9f random samples is the sum of the variances of the two means con- 


I (1/my + 1/ng). 
ny + na — 2 


180 SAMPLING NORMAL POPULATIONS Ch. 6 


sidered separately. The reader should verify the fact that if in for- 
mulas 6.53 and 6.54 nı = nz = n, these formulas become 6.51 and 
6.52, respectively. 

Many other applications of the t-distribution, and accompany- 
ing statistical techniques, could be cited; but the fundamental prin- 
ciples are essentially the same as those already explained. 


PROBLEMS 


l. Suppose that 5 experimental concrete cylinders of each of two types of 
concrete have been tested for breaking strength, with the following results in 
hundreds of pounds per square inch: 


Type 1: 40, 50, 48, 46, and 41; and 
Type 2: 65, 57, 60, 70, and 55. 


Use the t-distribution to determine if the difference in average breaking strength 
between the two types of concrete can be assigned reasonably to mere sampling 
variation. 

2. Suppose that two groups of 10 steers have been fed two different rations 
(one to each group) and that the steers are of the same age, breed, and initial 
weight. Given the following computations determine the 99 per cent confidence 


interval on the true difference between the means of the average daily gains 
under the two rations: 


Ration A Ration B 
n = 10 n = 10 
Zi = 1.90 lb/day de = 1.55 lb/day 


s = 0.20, 18 D/F; t = 3.92 


Ans. Clw:0.1 € | m — m| < 0.6 Ib/day. 


3. Suppose that an experiment has been set up at an engineering laboratory 
to determine the difference in average breaking load between oak and fir beams 
of the dimensions: 2 inches x 2 inches x 28 feet. 


The data from tests on 10 
beams of each wood are as follows, in pounds: 


Oak: 725, 1015, 1750, 1210, 1435, 1175, 1320, 


1385, 
Fir: 1205, 810, 1110, 530, 765, 1075, 1475, 


950, 
Oak: 1505, and 1340. Sum = 12,860: EX? = 17,243,550. 
Fir: 1020, and 1070. Sum = 10,010: Ex? = 10,625,400. 


If you can afford a risk of an error of only 1 in 100 what confi 
set on the true difference in average breaking load for these t 

4. Draw 5 pairs of samples, each with n = 10, from the 1 
furnished you by the instructor, and compute t = 
Then obtain from Table IV the probability that a 


or larger would be obtained while pairs of samples ar 
population. 


dence limits do you 
Wo materials? 

0 aboratory population 
d/s@ for cach pair of samples. 
numerieal value of t that size 
€ drawn from the same normal 
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5. An experiment designed to find out if supplemental lighting with incan- 
descent lights will increase the vitamin C content of greenhouse tomatoes pro- 
duced the following results in milligrams per 100 grams for tomatoes on the 
bottom two clusters of the plants: 


No extra light: 25.92, 28.08, 21.27, 22.53, 26.27, 22.57, 22.57, 30.19, and 20.35. 
EX = 219.75, EX? = 5454.8279. 

Incandescent: 20.30, 29.21, 20.50, 21.50, 23.71, 29.34, 26.32, 15.55, and 29.56. 
EX = 215.98, EX? = 5378.5612. 


Use the t-test to decide whether or not the incandescent lights changed the 
average ascorbic acid concentration in the greenhouse tomatoes. 

6. Given the following two sets of simulated data, assume first that the obser- 
vations are paired (vertically) and compute and interpret the t. Then assume 
that the observations are not paired and again do a t-test. Compare these 
results and the hypotheses tested. 

A: 85 72 28 59 75 46 39 08 53; EX — 525. DX? = 33,369. 
B: 80 65 24 58 65 40 38 60 42; EX — 472. EX? = 27,198. 
Ans. (a) t = 4.97, 8 D/F, reject Holu = 0). 

(b) t = 0.78, 18 D/F, accept Ho(ua = uB). 
has been found to be a growth stimulant for 
certain animals. The discovery is illustrated by the following two sets of data 
obtained at Kansas State College under the direction of Dr. E. E. Bartley of 


the Department of Dairy Husbandry. The measurement of growth used is the 
total gain during a 12-week period, expressed as a percentage of birth weight. 


7. The antibiotic, aureomycin, 


No Aureomycin Had Aureomycin 


77.6 125.6 
81.3 135.5 
109.2 122.9 
124.1 144.8 
101.4 103.3 
106.0 142.9 
81.7 poe 
70.6 2X2 = 775.0 
54.8 fe= 129.17 
43.3 X(X9) = 101,298.36 
119.2 
100.0 

EX; = 1069.2 

z,— 89.10 


Dae) = 6954.96 


Obtain the 95 per cent confidence interval on the true difference between the 
two means uy and py and tell what information this interval makes available. 
8. Suppose that 31 rainbow trout and 31 brook trout are taken at random 
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from a mountain stream and are measured for length. The rainbows averaged 
9.2 inches, with s = 2 inches; the brook trout averaged 87 inches, with standard 
deviation = 2.1 inches. Test the hypothesis Ho(u; = 19) and draw appropriate 
conclusions. 

9. If from a certain study, 1 = 32.7 and z» = 35.9, and the pooled estimate of 
cis s = 7.5. Both samples contained 12 observations. Test Ho la = 73 =1. 

10. Suppose that 15 samples of each of two varieties of tomatoes haye been 
analyzed for vitamin C, with these results: 


Variety 1 Variety 2 
Hy = 28.5 ī = 30.4 
Za) = 50 2(x2?) = 60 


Test the hypothesis that the true average ascorbic acid concentration in these 
two varieties is the same. 


6.6 USE OF THE SAMPLE RANGE INSTEAD OF THE 
STANDARD DEVIATION IN CERTAIN TESTS OF 
STATISTICAL HYPOTHESES 


The most difficult computational part of the t-test is the determina- 
tion of either s; or sz, as the case may be. Another method of testing 
hypotheses can be used in some situations without the need to com- 
pute these standard deviations at all. It uses the sample range as its 
measure of variation. The loss of precision is not serious for small 
samples, becomes greater as the size of the sample is increased, and 
renders the method useless for large samples. The trouble is that the 
sampling variability of the range is almost as low as that of the stand- 


ard deviation for small samples but increases quite rapidly with n. 
'The ratio 


(6.61) G = (ž —»)/R, 


where R = sample range can be used in a manner analogous to the 
t-test procedure. When G has been calculated, Table IX gives the 
probability that such a sampling | G | will occur by chance for samples 
of size n if the hypothesis regarding y is exactly right. Thereafter the 
reasoning is just as it was in section 6.4. 

When two random samples, each of size n, have been drawn from 


what is assumed to be the same normal population, the ratio 


(6.62) rM Eg 


mean range 


, 


where mean range — arithmetic mean of the ranges of the two samples 
can be used on problems like those in section 6,5, Table X now is used 
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instead of Table IX. Again the tables give P( G | > Go), where Go 
is the observed numerical size of G. 

To illustrate the application of formulas 6.61 and 6.62 reference is 
made to the problems solved in sections 6.4 and 6.5. First consider 
problem 6.41. The sample mean is 3 = 5.65 and the range is 15; 
therefore, for Ho(u = 0): G = (5.65 — 0)/15 = 0.377, with the sam- 
ple size = n = 20. By Table IX, the probability that G would be so 
large if y actually were zero is much less than .001; hence the hypothesis 
that u = 0 is rejected decisively, as it was from the t-test. 

The next example is from section 6.5 and involves two diets fed to 
steers. In fact, 3, = 1.54 and d = 1.74, Ri = 0.26, Rə = 0.29, and 


. hence the average sample range = 0.275. Then 


G = 0.20/0.275 = 0.727, with each n = 10. 


By Table X a G larger than 0.727 would occur by chance less than 
0.1 per cent of the time if both samples were from the same normal 
population. The hypothesis is rejected; that is, the second diet, 
which produced the higher average gain in the sampling is considered 
to produce higher average gains than the first diet. 

A Given the tables and formulas above, we can derive confidence 
intervals on y as before when n is small. This interval would not 
be expected to be identical with one obtained from the t-distribution 
for the same confidence coefficient; but it has been shown that, on 
the average, the two intervals are very close to the same length as 
long as n is small. (Specifically, K. S. C. Pillai has shown in the 
September, 1951, number of the Annals of Mathematical Statistics 
that the ratio of the average lengths of the Clos's by the two meth- 
ods still is 0.97 when n = 20.) To illustrate, consider again the 
problem of section 6.4 just used above to illustrate the G-test when 
there is one set of n observations. In this problem the two confi- 
dence intervals are obtained as follows: 


5.65 — u 
—2.1 << +21 
DE 0.87 


and the 95 per cent confidence interval is 
3.82 < p < 7.48. 


Using the ratio G, we have 


65 — 
pane 2 P E < 40.126 
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and hence the 95 per cent confidence interval is 

3.76 < u < 7.54, 
which is very much like that derived from the t-distribution. 


PROBLEMS 


1. Solve problem 1, section 6.4, with the G-test instead of the t-test. 

2. Solve problem 3, section 6.5, by means of the G-test. 

Ans. G = 0.289; PŒ 064; accept I, tentatively. 

3. Draw 25 samples, each with n = 10 from a near-normal population, and 
compute the G for each sample. How many of these G's fall beyond 0.186 in 
numerical size? How do your results check with Table IX? 

4. Suppose that a college is attempting to learn if instruction of a certain 
type improved in one year a student's ability to think analytically. Also as- 
sume that tests exist which reliably measure such ability, and that these tests 
are given at the beginning and at the end of the school year. If the following 
differences between the last and the first score of each student were obtained, 
would the G-test cause you to accept or to reject the hypothesis that the teach- 
ing procedures employed failed to improve analytical thinking? 


X: 5,0, 10, —4, —6, 8, 1, 7, —10, 0, 3, 5, —1, 8, 4, 0, —3, 7, 7, and 9. 


Ans. G — 0.125; P — .05; reject Ho(u = 0). 

5. Make up, and solve, a problem like problem 4, which has the same z but for 
which G is twice as large. Half as large. 

6. Suppose that information is sought analogous to that in problem 4, but 
there are two separate classes of 15 students being taught by each method. The 
two classes are supposed to be equal at the start of the teaching period. Given 
the following gains (+) or losses (—) in score during the year, 


draw appro- 
priate conclusions by means of the G-test and Table X: 


Method I: 10, 3, —2, 5, 0, —8, 14, 1, —12, 5, 5,9, 7, — 1, and 9. 
Method II: —2, 5, 5, 4, 0,7, 6, —1, 4, 10, 8, 11, 10, 0, and 13. 
Ans. G = 0.114; P> 10; accept Hy(uy = my). 
7. An experiment intended to discover if blue fluorescent lights will increase 
the vitamin C concentration in tomatoes on the seventh and eighth clusters 
from the bottom of the plant gave these results, in milligrams per 100 grams: 


No extra light: 38.57, 39.39, 33.44, 34.32, and 38.01, 

Blue fluorescent: 33.72, 37.85, 39.07, 31.16, and 35.69. 
Test the hypothesis that the blue light does not change the vitamin C con- 
centration, and draw valid conclusions. 

8. Suppose that two methods of computing basal metabolism for the same 
11 subjects produced the following pairs of records, in calories per square meter 
per hour. 

I: 31.42, 30.90, 34.92, 30.59, 30.53, 33.08, 32.61, 30.46, 

II: 30.73, 31.44, 32.82, 31.80, 29.16, 32.96, 32.32, 30.76, 
I: 30.55, 33.19, and 29.22. DX, = 347.47. 
Il: 27.65, 32.54, and 20.30. EX2 = 34148. 
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Use the G-test to decide if one method tends to produce higher metabolism 
records than the other, and explain your decision in terms of sampling phe- 
nomena, Ans. G = 0.100; P .10; accept Holu, = ma). 

9. Some varieties of wheat produce flour which typically takes longer to mix 
into proper doughs than others. Decide by the G-test if Kharkof actually has 
(as appears from the samples) a longer mixing time than Blackhull: 


Kharkof: — 3.00, 1.88, 1.62, 1.50, 1.75, 1.38, 1.12, 1.88, 2.50, 1.62, 2.88, 2.50, 3.88, 
and 2.75. Mean = 2.16. 


Blackhull: 1.25, 2.38, 1.62, 1.50, 1.25, 1.38, 2.25, 2.12, 1.84, 2.38, 2.25, 1.50, 2.00, 
and 1.62. Mean — 1.84. 


10. Compute the 90 per cent confidence intervals for the two varieties of 


problem 9 and compare them. Draw appropriate conclusions. 
Ans. Clgg: — 0.04 £ | i4 —12 | = + 0.69. 


6.7 THE CENTRAL LIMIT THEOREM AND NON- 
NORMAL POPULATIONS 


The statistical methods which have been discussed in this chap- 
ter are based on the assumption that the populations involved are 
normal. In practice this requirement rarely is met rigorously; hence 
we may wonder if the subject matter of this chapter is chiefly of 
academic interest because it does not fit actual conditions. This is 
not the situation because of the truth of the central limit theorem. 

This theorem states essentially that if any population of numerical 
measurements has a finite mean and variance, # and c?, respectively, 
the frequency distribution of the sampling mean, Z, will be essentially 
a normal distribution with mean = » and variance = c? /n if the n is 
very large. As a matter of fact, the necessary size of n depends on the 
degree of non-normality of the original population. Tables 6.714, B, 
C, and D summarize a decidedly non-normal population of counts of 
flies on dairy cattle, and show some observed distributions of z's for 
Samples with n = 9, 16, and 25. Figure 6.71 displays these same dis- 
tributions visually. It is rather obvious that none of these sample 
Sizes is very large, and therefore the distributions of z are still notice- 
ably non-normal. However, the meaning of the central limit theorem 
1s illustrated. 

It ean be seen from Tables 6.71 and from Figures 6.71: 
ation is extremely non-normal. 
servations per sample, the distri- 
ards fulfilling the ideal expressed 


(a) That the parent popul 

(b) That even with only nine ob 
bution of z has gone a long step tow 
by the Central Limit Theorem. 
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(c) As n was increased, the distribution of z, and its mean and vari- 
ance, approached more and more closely to those features which the 
Central Limit Theorem assures us will be attained if n is sufficiently 
large. 


The foregoing discussions are not intended to make us ignore the 
non-normality of distributions met in practice, but they do indicate 
that a great many moderately non-normal distributions can be studied 
by means of the techniques explained in this chapter. 

In this chapter the ratio ( — u)/s¿ was said to follow the i-dis- 
tribution with the same number of degrees of freedom that s; has as 
a sampling estimate of dz. Actually any ratio (w — )/s will follow 


TABLE 6.714 


SUMMARY or Counts or Fries ox DAIRY CATTLE TETHERED IN A FIELD AT 
Kansas State COLLEGE AFTER THEY Were SPRAYED WITH AN EFFECTIVE 
Fry REPELLENT 


Normal r.c.f., 


Class Interval y T..f. Samepando Difference 
168-175 1 1.000 1.000 0 
160-167 0 1.000 1.000 0 
152-159 al 1.000 1.000 0 
144-151 0 .999 1.000 —.001 
136-143 0 .999 1.000 —.001 
128-135 0 .999 1.000 —.001 
120-127 1 .999 1.000 —.001 
112-119 3 .999 1.000 —.001 
104-111 6 .998 1.000 —.002 

96-103 2 .995 1.000 —.005 
88- 95 vá .994 1.000 — .006 
80- 87 8 .992 1.000 —.008 
72- 79 10 .988 1.000 —.012 
64- 71 11 .984 -999 —.015 
56- 63 31 -980 .999 —.019 
48- 55 26 -968 +994 — .026 
40- 47 59 .957 976 — .019 
32- 39 102 984 -932 +.002 
24- 31 206 .893 .837 +.056 
16- 23 392 .812 .687 +.125 
8- 15 771 .656 .500 +.156 

oz 869 .948 -309 +.039 

2(f) = 2506 


‘u = 15.87, 0? = 257.28 


TABLE 6.718 


DISTRIBUTION or Means or RANDOM SAMPLES WITH n = 9 Drawn FROM 
THE POPULATION OF TABLE 6.714 


Normal r.c.f., 


& Interval $ rcf. Samepando Difference 
36-38.99... 3 1.000 1.000 0 
33-35.99... 8 .997 1.000 — .003 
30-32.99... 4 .990 1.000 —.010 
27-29.99... 11 .986 .998 —.012 
24-26.99... 30 975 .988 —.013 
21-23.99... 78 .947 .954 — .007 
18-20.99... 139 877 .866 +.011 
15-17.99... 221 744 .701 +.043 
12-14.99... 268 .533 .480 +.053 

9-11.99... 211 .278 .206 +.012 

6- 8.99... 74 .076 .113 —.037 

3- 5.99... 6 .006 .037 —.031 

E(f) = 1048 


us = 15.27, cz = 26.72 
By Central Limit Theorem (if n is large enough) uz = 15.37 
oz = 28.59 


TABLE 6.71C 


Disrripution or Means or RANDOM SAMPLES WITH ? 
THE POPULATION OF TABLE 6.714 


Normal r.c.f., 
rcf. Same“ ando Difference 


n = 16 Drawn FROM 


3 Interval f 
32-33.99... 2 1.000 1.000 0 
30-31.99... 1 .998 1.000 —.002 
28-29.99... 6 .997 1.000 —.003 
26-27.99 13 .991 .999 —.008 
24-25.99... 13 .979 .994 —.015 
22-23.99... 40 .966 .978 —.012 
20-21.99. . 74 .928 .938 —.010 
18-19.99... 106 856 .855 +.001 
16-17.99... 162 754 .719 +.035 
14-15.99.. 209 .597 .540 +.057 
12-13.99... 206 395 .352 +.043 
10-11.99... 145 .196 .195 +.001 

8- 9.99... 51 .056 .091 —.035 

6- 7.99... pM .007 .035 —.028 

Z(f) = 1035 
pe = 15.56, 0% = 17.32 


By Central Limit Theorem (if n is large enough) uz = 15.87 
cz? = 16.08 
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TABLE 6.71D 


DISTRIBUTION or MEANS or RANDOM SAMPLES WITH n = 25 DRAWN FROM 
THE POPULATION OF TABLE 6.714 


Normal r.c.f., 


£ Interval $ r.c.f. Same pando Difference 
26.00-27.49 3 1.000 1.000 0 
24.50-25.99 3 .997 1.000 — .003 
23.00-24.49 16 .994 .997 — .003 
21.50-22.99 22 .978 .990 — .012 
20.00-21.49 43 .956 .968 — .012 
18.50-19.99 85 .914 .917 — .003 
17.00-18.49 121 .830 .821 +.009 
15.50-16.99 180 .710 .673 +.037 
14.00-15.49 188 .532 .492 +.040 
12.50-13.99 168 .346 .313 +.033 
11.00-12.49 133 .179 .169 +.010 
9.50-10.99 41 .048 .077 — .029 
8.00- 9.49 5 .007 .029 —.022 
6.50- 7.99 2 .002 .009 — .007 
Z(f) = 1010 


wz = 15.51, 07? = 10.07 


By Central Limit Theorem (if n is large enough) uz = 15.37 
oz = 10.29 


the /-distribution as long as w is normally distributed with mean D» 
and s is calculated as described earlier. Hence if w is a sample mean 
drawn from a non-normal population which satisfies the few require- 
ments of the Central Limit Theorem, and if » is large enough, the 
ratio (w — u)/s can be considered quite accurately to follow a t-dis- 
tribution. Thereafter the methods introduced in this chapter for 
estimating parameters and for testing hypotheses regarding parameters 
become applicable. 

One word of warning is in order, however, before this subject is left. 
In any particular sampling situation, the standard deviation, o;, needs 
to be estimated from the sample. This is done by means of sz. What 
happens to the quality of this estimate when the parent population is 
radically non-normal? Under such circumstances the beginner is 
advised to seek the advice of a statistician. 
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REVIEW PROBLEMS 


1. Who was Student, and how was his work connected with the development of 
present-day methods of statistical analysis? 

2. Calculate the arithmetic mean and the standard deviation for a set of num- 
bers, Y; given that EY = 900 grams, and EY? = 55,465 grams”, where the Y's 
are the weights of female rats at 28 days of age. There are 20 rats in the sample. 

Ans. y = 45 grams. s = 28.1 grams. 

3. Compute the 80 per cent confidence interval for problem 2 on the true 
mean 28-day weight of such rats, and draw appropriate conclusions. 

4. What would be the general ehange in the confidence interval of problem 3 
if 95 per cent limits instead of 80 per cent limits had been computed? What 
would be the effect if the ZY? had been smaller, the remainder of the numbers 
staying the same? 


5. Graph the binomial frequency distribution of the numbers of sums of 6 
thrown with two unbiased dice on sets of 8 throws. 

6. Compute for problem 5 the probability that on any partieular future set 
of 8 throws at least 3 sums of 6 will be thrown. Ans. 087. 

7. Take any newspaper which lists prices of bonds and determine the median 
price and also the range. 

8. Calculate the coefficient of variation for problem 2, using jj in place of » and 
s in place of e, and tell what sort of information it provides about the weights of 
the rats in the sample. Ans. CV = 62.4 per cent. 

9. Draw 10 samples of 12 members each from the laboratory population and 
compute £ and G for each sample, using the correct hypotheses regarding y. 

10. Determine the upper limits of the 20th and 85th percentiles for the fre- 
quency distribution of Table 6.31 and state what information they give. 

Ans. Upper limit of 20th percentile — 0.90 by interpolation, — 0.86 by Figure 


631. Upper limit of 85th percentile — 1.12 by interpolation, = 1.10 by 
Figure 6.31. 


11. If 100 4 were to be drawn at random from among those summarized in 
Table 6.31, what is the expected number of them falling between ¢ = 0 and 
t = 1.50? 


12. Following are some experimental results from tests of the breaking 
strengths of the wet warp of rayon and wool fabrics in pounds: 


Rayon: 29.5, 31.0, 28.7, 29.1, 28.4, 28.9, 30.9, and 29.0. 
Wool: 25.3, 28.9, 19.2, 25.1, 21.1, 31.4, 25.6, and 19.0. 


Does the difference in average breaking strength lie beyond the bounds of 
ai se ae hee to the t-test? Solve problem also by 
the G-test, and compare the conclusions, C. = 235 AE 
EX jp ="195.6, EX?,. = 492148. 2Xp = 235.5, EX2, = 6930.33. 
Ans. t = 3.10, 14 D/F, P = .008; reject Hou = u). G= 0665, n=8, P= 
002; reject Ho(u, = lo). H ; 

13. Suppose that twelve 2 inch x 12 inch x 8 inch wood blocks were tested 
for strength with the following results in thousands of pounds: 6.5, 17.0. 
100, 15.1, 13.5, 164, 198, 77, 115, 145, 127, and 129. Place 95 per cent confi- 
REA limits on the true average strength of such blocks, and interpret these 

mits. 
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14. You are given the following hypothetical data from an experimental study 
of the average daily gains (in pounds) of two groups of 10 steers each: 


For group A: Za = 2.35, 4 = 12. 
For group B: žg = 1.75, Zz?g = 180. 


Is the difference in mean average daily gain, d = 0.60 pound, beyond the bounds 
of reasonable sampling variation; that is, is it statistically significant? 
Ans. t = 0.34; 18 D/F, P = .63; accept Ho (ui = us). 

15. Suppose that you have taken the bid in a bridge game and that you and 
your partner have all the trumps but J, 10, 7, 4, and 3. Before you have led 
at all, what do you compute as the probability that you would get all the 
trumps out within 3 leads? 

16. Suppose that you have been told that when 6 unbiased coins were tossed 
at least 3 of them showed heads. What is the probability that exactly 4 of the 
coins turned up heads. Ans. P(r = 4 heads) = 15/42. 

17. Suppose that a large jug contains the following numbers of each de- 
nomination of paper currency, and that you are to withdraw a bill without look- 
ing and keep it: 50 one-dollar bills, 25 five-dollar bills, 10 ten-dollar bills, 5 
twenty-dollar bills, 2 fifty-dollar bills, and 1 one-hundred-dollar bill. What is 
your mathematical expectation on such a game? 

18. If 2 cards are drawn simultaneously from a bridge deck, what is the 
probability that one will be a spade, the other a heart? Ans. 13/102. 
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CHAPTER 7 


Linear Regression 
and Correlation 


Tt often is advantageous to consider two types of numerical meas- 
urements simultaneously because they are related to each other. For 
example, the following table records the mean monthly temperatures 


from January to July at Topeka, Kansas, along with the month of 
the year: 


Month of the Year: Jan. Feb. Mar. 
Mean Temperatures (degrees 
Fahrenheit) 


Apr. May June July 
38.0 41.7 54.0 66.0 74.4 83.8 88.7 


If the month to which each temperature applies were to be ignored, 
these temperatures simply would be seven numbers which might fall 
in the following random order (obtained by drawing them at ran- 
dom): 88.7, 54.0, 66.0, 38.0, 83.8, 74.4, and 41.7. In this form the 
numbers seem to be quite variable about their arithmetie mean, 
63.8°F. However, when considered in conjunetion with the month 
as the second variable, these temperatures follow an orderly pattern. 
This point is illustrated graphically in Figures 7.014 and B, in which 
temperatures are first plotted against the random order in which they 
were drawn, and then against the month to which they apply. 

Figure 7.01A merely re-emphasizes the remarks made above about 
the excessive variability about the mean, 63.8, and suggests that such 
an average would be of doubtful utility because the temperatures are 
too inconsistent. But it appears from Figure 7.01B that the mean 
temperatures for the first six months of the year increased in quite 
an orderly manner from month to month, with little deviation from a 
linear upward trend. Hence, a better analysis of these data can 
be obtained by taking proper account of the second variable, time. 

A straight line is drawn into Figure 7.014, 63.8 units ahora the 
horizontal axis, to represent the arithmetic mean of the tempera- 
tures whose individual magnitudes are indicated by the ordinates of 
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the points on the graph. The amounts by which the monthly mean 
temperatures are above or below the mean of all the temperatures 
are shown as vertical distances above or below the horizontal line. 
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Figure 7.01. Mean monthly temperatures 
the corresponding month. 


random and then according to 


As far as Figure 7.014 is concerned these deviations are simply the 
Consequences of unexplained variations in temperature. However, 
When each temperature is associated with the month to which it 


belongs (as in Figure 7.01B) it is apparent that all but a small 


amount of the variability among these temperatures is associated 


with a definite tendency to increase rather uniformly with succeed- 
Ing months of the season. 
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The trend line drawn into Figure 7.01B was determined just “by 
eye"; but it usually is preferable to have a standard method of de- 
termining where the line should be drawn. This matter will be dis- 
cussed in the following four sections. 


7.1 SCATTER DIAGRAMS AND TYPES OF TREND LINES 


À number of the statistical methods with which the reader is al- 
ready familiar can be employed in the analysis of data involving two 
variables. One additional matter must be studied, however, namely, 
the relationship between the two variables. A little graphie analysis 
usually is worth while before the numerical analyses are undertaken. 

There are many ways in which one variable, Y, can change with 
respect to another variable, X, as successive pairs of observations 
are taken with the X, say, increasing in magnitude. The size of Y 
may tend to increase as X increases; Y may tend to decrease as X 
increases; or some of both may occur over the range of values studied. 
In addition there are numerous ways in which Y can increase as X 
increases; and similarly for the other possibilities just mentioned. 


To illustrate, consider the following tables of pairs of values for X 
and Y: 


(A) (B) (C) (D) (E) (F) (G) 
AR Xr | xe E 


"xc. ccr du Mmi We T un, gp cM n 


E BO; 395 ^. DR ^ =a, oe 0 40 
? 12 230° MO. EOS " CS 5. : a "i 25 
5 14. 3 94 3 99 1.0 1.20 —1 0.5 3 322 2.0 44 
7 0 4I 24 io 005 4 600 252 
0 92  5' 9 3522 20 380 107 6 i272 30 35 
11 30 2.5 575 2 48 3.5 28 
3.0 10.40 3 89 40 47 

4.5 26 

5.0 34 


It is helpful to a mathematical study of the relationship between 
two variables if the pairs of corresponding numerieal measurements, 
X and Y, are represented by points on a graph, as they were in ele- 
mentary algebra. This has been done in Figures 7.114, B, ..., G 
for the data immediately above. ae 

Such graphs are called scatter diagrams. 
figures that the points may not exactly fit any 
sometimes do exhibit a general pattern which 
study the relationship between Y and X. 


It is noted from these 
simple curve, but they 
may make it possible to 
It is necessary here to 
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think in terms of general rather than precise curves as was done in 
algebra, where all points which belonged with a certain graph fell 
exactly on that graph. Data to which statistical analysis is applied 
are not so well behaved as that. It will be necessary later to learn 
how to decide which curve to choose as best describing the relation- 
ship between X and Y suggested by a scatter diagram; and it will not 
be expected that all the points will fall perfectly on the line finally 
chosen. 


The following information can be derived from a careful inspection 
of Figures 7.11: 


From (A): Y definitely tends to increase uniformly (linearly) as 
X increases. On the average, Y increases about 13/6 units for each 
unit increase in X. 

From (B): Y decreases in proportion to the increase in X. Again 
the relationship can be briefly described as linear. More specifically, 
Y tends to decrease about 10 units for each unit inerease in X. As 
a result the slope of the straight line which indieates the linear trend 
is said to be —10. 

From (C): Y has no apparent relation to X; hence the X measure- 


ment may as well be ignored in the statistical analysis of the meas- 
urements, Y. 


From (D): Y increases with X A 
fact, Y increases more rapidly for 
This relationship between Y and 


but the increase is not uniform. In 
large X's than for the smaller X's. 


X is called curvilinear. In this in- 
stance, it follows the non-linear mathematical law: Y = Q.5e", 
where e is the base for natural logarithms. 

From (E): As the measurement represented by X increases from 
—3 toward 0, the corresponding measurement, Y, tends to decrease 
in a non-linear manner. "Thereafter, Y inereases non-uniformly. As 
a matter of fact, the points on this scatter diagram tend to follow 
the curve, Y = X?, 


From (F): Y tends to increase non-uniformly with X, as in (D), 
but the eurve rises more sharply here. 


From (G): There is no apparent relationship between X and X. 
as in (C). 


Another point should be noted regarding the scatter diagrams of 
Figures 7.11. If the concomitant measurement, X, were to be ignored 
during an analysis of the data of Y corresponding to any of the sit- 
uations except (C) and (G), a considerable Portion of the variability 
of the Y’s about their means would represent, 


; E Unnecessary variation 
in this sense. We know from (B), say, that if X — 1, the corre- 
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sponding Y is necessarily about 40 units larger than if X — 5 because 
there is definitely a downward trend of Y as X increases. If the X 
Were ignored, all that observed difference of 40 units must be as- 
signed to errors of measurement and/or to sampling accidents when, 
in fact, only about one unit should be so assigned. Methods will be 
presented later on in this chapter by which the apparent variation 
among the Y's ean be reduced by taking account of the statistical 


relation between X and Y. However, nothing extensive will be done 
with non-linear trends. 
PROBLEMS 
1. Construct a scatter diagram for the following pairs of measurements, and 

draw in by eye a straight line which seems to you to best depict the trend of 
Y with X. Is the assumption that Y and X are linearly related a good one in 
your opinion? 

X: 2 4 6 8 10 12 14 16 18 20 

Y: 100 140 200 235 280 325 370 415 450 500 


2. From the following sampling d 
average, for each unit increase in X. 


X: 36 43 50 40 42 45 40 45 3 
Y: 135 170 190 155 165 180 163 175 


ata estimate how much Y changes, on the 


9 48 
1.60 1.03 


Ans. About 0.04. 
ollowing pairs of observations and draw 


3. Make a scatter diagram of the f 
Y changes with X: 


in a freehand line to summarize the way 

X, Y: 11; 225; 34.5; 460; 5.100; 6145; 7,23.0; 835.0. 
he assumption that the carotene and the nitrogen- 
are linearly related if assured that the 
tative sample for pasture grasses 


4. Would you approve of t 
free extract contents of pasture grasses 
following pairs of such values form a represen 
Of a given sort? Justify your decision. 

X(NFE): 50 48 53 51 49 53 51 48 

Y (Carotene): .44 26 20 24 44 33 20 34 
and corresponding standard deviations obtained 
ach drawn from an approximately normal popu- 
and decide what, if any, relationship exists 
ation from a normal population if 
Plot Z on the horizontal axis. 


S. Tue following are means 
theta samples of 10 observations, e 
b n. Construct a scatter diagram ; 
een the sampling mean and standard devi: 

ese samples are representative of such populations. 


= s E] s z s z s 
58.4 6,04 53.4 10.33 63.0 9.33 60.4 — 11.40 
55.7 9.61 60.9 12.64 52.9 6.13 61.2 6.45 
62.0 6.70 567 8.03 59.1 8.15 67.7 10.21 
61.7 11.40 55.0 11.51 53.6 9.88 56.4 — 5.81 
54.0 — 19.52 59.3 — 9.05 57.8 10.92 514 — 10.78 
59.0 11,09 51.1 8.75 60.0 8.17 62.8 8.43 
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6. Grades in elementary statistics and in mathematics of finance for the same 
students are given below. What do you conclude is the relation between a 
student's grades in these two subjects? Give evidence upon which your con- 
clusion is based. 


X(statistics): 94 83 91 98 80 82 61 81 58 90 85 75 75 70 92 62 
Y(finance): 89 90 91 97 85 87 41 88 60 85 86 83 87 72 97 64 


7. The following are weights of the larvae of honey bees at different ages: 


Set A. X(days): 1 2 3 4 5 6 
Y (milligrams): 2.0 4.3 23.1 93.1 148.7 295.5 
Set B. X(days): 1 2 3 4 5 6 


Y (log milligrams): 0.30 0.63 1.36 1.96 2.17 2.47 


Construct scatter diagrams for each set separately and decide for which, if 
either, the assumption of a linear relation between X and Y appears to be justi- 
fied. If either set produces a satisfactorily linear trend, estimate the slope of 
the best-fitting freehand line and state what information it provides, Should 


you make some allowance for the fact that you used a freehand line in a posi- 
tion with which others might disagree? 


8. Make a scatter diagram for the following pairs of observations. Draw in 
a freehand line which appears to you to be the best-fitting straight trend line, 


and derive from this line an estimate of the Y which should correspond, on the 
average, to X = 0, 45, and 7.5: 


a ob 2 3 4 5 6 7 8 9 10 
Vie al 20 17 15 14 14 12 9 6 5 


Ans. About 23.3, 15.1, and 9.7, respectively, 


7.2 A METHOD FOR DETERMINING LINEAR 
REGRESSION (OR TREND) LINES 


It is quite customary to use the term regression line to describe 
the line chosen to represent the relationship between two variables 
when this decision is based on sample points, as in a scatter diagram. 
The origin of the term regression probably lies in genetic studies of 
the tendency for offspring of parents who are well above, or below, 
the group average to go back, or "regress," toward that group aver- 
age. The term trend line will be used interchangeably with regres- 
sion line, even though the former is frequently associated with dis- 
cussions of time series. 

Previously, in this chapter, freehand lines have been used to depict 
the average change of one variable with respect to another. Such 
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a procedure, however, obviously is somewhat subjective because it 
depends quite a bit upon personal opinion. One of the chief purposes 
of numerieal measurement and statistical analysis of such measure- 
ments is to free decisions based on relatively precise numbers from 
distortions which might result from the exercise of personal tastes 
and opinions. It is for this reason that it is desirable to be able to 
deseribe such a line by a method which will produce the same result 
no matter who uses it. 

The types of relationships between two kinds of numerical meas- 
urements which were discussed in the preceding section are illustra- 
tive of sampling experiences involving errors of observation and 
measurement. The dots of the scatter diagram usually fail to fall 
exactly on any simple curve for one of two reasons: (a) Sampling 
errors or chance variations cause the values of Y, say, to be partially 
inaccurate. (b) There are real variations from the general trend of 
Y and X which, however, are of minor importance compared to the 
general trend and should be smoothed out in order that the general 
trend may be studied more effectively. The data of Table 7.21 and 
the corresponding scatter diagram of Figure 7.21 help to illustrate 
these points. The data in the table are considered to be a population 
of pairs of observations, that is, a bivariate population. For con- 
venience these data have been grouped by 16-week weights (X) to 
the nearest pound in the seatter diagram of Figure 7.21. 

The bivariate population of Table 7.21 possesses several character- 
isties which are of statistical interest and importance. These fea- 
tures are exhibited by Figure 7.21, from which it is learned: (a) 
There is a general upward trend of the 28-week weight, with increas- 
ing 16-week weight of the same bird. (b) Within each 16-week- 
weight class there is a frequency distribution of 28-week weights, 
and this distribution is relatively symmetrical about the mean 28- 
week weight for the class. (c) The means of the six 16-week-weight 
Classes lie perfectly on a straight line with a slope of 1/2. Thus the 
true linear regression line passes through the points representing 
the true average Y’s for the given X’s. The slope of this true trend 
line is denoted by the Greek letter B. (beta). . 

In a study based on samples the B 15 unknown, às 1s the exact loca- 
tion of the true linear trend line, and only the m pairs of sample 
Measurements are available as a basis for making decisions about the 
linear trend line. For example, a random sample of 30 pairs (x 2) 
was drawn from the bivariate population of Table 7.21, with the 
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TABLE 7.21 


PAIRS OF OBSERVATIONS OF THE 16-WEEK WEIGHTS AND CORRESPONDING 
28-Wsex Weicuts or Turkeys RAISED on THE Kansas STATE COLLEGE 
Poutrry Farm 


(X is the 16-week weight in pounds; Y is the 28-week weight.) 
X Y X Y X Y X Y X Y 


49 133 5.4 146 5.2 134 5.4 140 46 13.0 
47 12.7 48 12.9 49 124 5.0 12.6 5.1 125 
5.2 131 5.3 13.5 54 13.6 47 143 49 13.8 
5.1 13.7 5.2 13.2 46 140 47 143 48 141 
49 13.6 5.0 14.0 5. 13.6 5.22 13.5 5.3 13.5 
6.2 13.5 65 148 64 143 65 135 64 144 
6.5 13.0 5.5 13.5 61 153 5.5 13.0 6.0 143 


67 148 68 150 73 128 73 143 T3 136 
70 135 70 150 68 139 74 141 6.9 150 
72 133 73 153 66 14.0 66 148 7.4 141 
73 140 6.9 140 7.0 141 68 142 74 156 
73 163 7.0 15.8 66 147 68 129 70 134 
72 148 74 14.9 75 147 76 147 TY 187 
7.8 147 79 148 80 157 $1 151 82 154 
83 157 84 157 8.5 158 75 145 $2 151 
83 155 78 153 7.9 16.6 83 171 $3 153 
75 154 7.6 147 &5 145 85 150 75 148 
84 156 75 15.6 78 161 7.8 155 77 159 
7.8 160 $3 134 82 149 81 144 82 140 
83 14.5 82 133 75 160 80 150 75 155 
I; 183 Te ma — 26 154 go ly — 4$ Ye 
80 104 81 167 85 151 7.6 r 


6 146 j 3 
78 143 78 152 78 170 79 153 23 ea 
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PAIRS or OBSERVATIONS OF THE l6-WrEK WEIGHTS AND CORRESPONDING 
28-Wrrk Wzranrs or TURKEYS RAISED ON THE Kansas STATE COLLEGE 


Poutrry Farm 


(X is the 16-week weight in pounds; Y is the 28-week weight.) 


x Y X y X Y X Y X d 
7.5 13.8 8.5 14.5 7.9 15.9 81 144 8.3 15.0 
8.2 156 84 14.8 7.8 13.7 8.2 15.0 7.6 13.3 
7.6 14.2 8.1 14.1 8.0 13.5 8.1 153 8.2 15.3 
7.5 15.5 7.9 15.8 7.7 163 78 17.0 78 13.5 
8.0 15.6 8.3 16.2 84 148 84 144 78 132 
8.5 16.9 8.0 13.7 8.2 15.5 8.3 15.0 8.2 15.7 
8&0 13.6 8.0 143 84 15.0 85 15.1 8&2 154 
8.0 14.9 8.1 15.0 8.5 15.2 78 14.0 7.6 15.2 
76 141 85 15.5 7.7 143 75 14.6 81 143 
81 144 8.0 13.8 82 13.8 76 14.9 82 15.3 
8.1 153 75 14.1 8.5 153 81 154 7.5 15.0 
8&0 150 78 140 8.5 16.0 7.9 15.5 8.1 16.5 
85 15.6 7.5 13.9 7.5 143 7.5 140 7.6 14.0 
7.8 140 88 14.2 9.0 15.5 8.9 15.0 88 144 
9.0 15.5 9.5 17.2 9.4 16.0 9.0 158 91 16.2 
8.7 163 94 16.9 9.4 16.0 9.0 15.1 8.7 16.3 
9.2 15.7 91 153 92 16.4 9.4 16.5 8.6 152 
8.7 15.7 9.0 15.9 8.9 14.6 9.0 16.3 9.0 16.1 
8.6 16.4 92 143 8.6 14.1 9.2 15.1 8.6 13.9 
8.6 15.0 87 162 9.0 16.1 91 153 9.2 15.1 
9.2 140 9.0 144 9.0 15.5 8.6 144 8.6 15.0 
87 153 9.1 16.5 8.8 15.5 &8 145 91 153 
8.6 15.9 8.6 151 9.2 15.8 9.2 15.5 8.6 14.6 
8.6 153 9.2 153 9.0 15.6 9.1 154 87 170 
8.8 14.8 9.0 15.5 9.2 15.8 9.0 15.5 9.0 15.5 
8.6 14.1 8.7 15.6 9.1 16.2 8.8 15.1 8.6 14.9 
8.7 14.9 87 14.7 88 148 89 148 9.0 14.9 
9.1 14.6 9.2 15.8 93 15.8 94 15.9 8.6 16.0 
8&7 16.0 8.8 16.1 8.9 16.6 9.0 16.6 9.1 16.7 
9.2 16.8 93 16.9 94 17.0 8.8 172 93 17.6 
8.6 14.5 87 145 88 14.2 8.9 15.0 9.0 15.0 
91 14.8 92 144 8.9 15.1 8.6 174 10.1 16.7 
98 154 9.5 148 9.6 15.0 98 15.2 9.8 16.0 
9.0 15.5 98 152 95 170 95 178 104 167 
9.5 16.7 10.5 17.1 98 16.4 95 155 9.6 16.0 
9.7 16.2 9.8 16.2 99 158 10.0 164 101 165 
9.5 15.7 10.2 16.8 10.3 16.9 10.4 15.9 9.5 14.8 
9.7 15.6 9.9 145 101 158 103 149 103 153 
9.06 14.4 105 17.5 
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results shown specifically in Table 7.22 and graphically as X's in 
Figure 7.21. The decision regarding linearity of trend and the esti- 
mation of any desired features of the true trend line (such as slope): 
must be accomplished from the information contained in the sample. 

If the trend of one variable with another is linear, the relationship 
between the two kinds of measurements, X and Y, is of the form 
Y = A + BX, in which, for illustration, Y stands for the 28-week 
weight of a certain breed of turkey and X = the 16-week weight of 


TABLE 7.22 
A RANDOM SAMPLE or n = 30 Parks (X, Y) rrom TABLE 7.21 
x F x F x Y x Y 


4.8 12.9 7.2 13.3 7.9 15.9 8.8 14.4 
6.5 14.8 7.0 15.0 8.0 14.9 9.2 15.7 
6.4 14.4 6.9 14.7 8.1 15.0 9.0 15.9 
5.5 13.5 6.8 15.1 8.0 13.8 8.6 14.4 
6.1 13.4 7.2 13.3 7.6 14.9 9.1 15.3 
6.0 13.6 7.5 14.3 7.5 14.1 9.0 16.6 
7.4 14.3 8.5 15.1 7.5 13.9 9.2 16.8 
7.0 14.5 8.5 15.2 


the same turkey. If all the observed pairs of measurements (X, Y) 
in Table 7.21 satisfy a linear equation perfeetly, all the points of 
Figure 7.21 will lie exactly. on the same straight line; and the rela- 
tionship between the two variables will be perfectly linear. More- 
over, the equation of the line can be determined from the coordinates 
of any two distinct points. Such obviously is not the case in Figure 
7.21 because errors of measurement and uncontrollable fluctuations 
in the 28-week weights of turkeys which weighed the same at 16 
averaged out before the trend appears to be 
tions met in elementary algebra, where 
appropriate points lie on the line, the 
the points given from sample observa- 
letermine which straight line best fits 
hoped, best estimates the true linear 


weeks of age must be 
linear. By contrast with situa 
the equation is given and all 
present situation starts with 
tions, and the problem is to ¢ 
these observations, and, it is 
regression line. 

Assuming th 
trend quite well, how can 


at a set of observations really does os i bh 


a specific equation of the form Y = 
be determined and also defended as the best straight line to be em- 
ployed under the circumstances? The answer to this question de- 
pends upon the interpretation of the word “best.” One interpreta- 
tion, and the one most frequently accepted, can be illustrated ^ 
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means of the line drawn into Figure 7.22. Some points lie above 
this line, some lie below, at distances whose magnitudes can be 
measured by the lengths of the vertical lines which could be drawn 
connecting the points of the scatter diagram to the regression line. 
In a useful sense, the goodness of fit achieved by any line drawn 
among the points to depict their trend should be measured somehow 
in terms of the amounts by which the proposed line misses the points 
of the scatter diagram. 


nA 
18 


17 


16 


28-Week weight 
a 


14 
13 
12 
5 6 7 8 9 10 X 
16-Week weight 


Figure 7.22. A random sample of 30 pairs of observations fr 
of Figure 7.21 and Table 7.21. Free-hand line to indie: 
appear to the eye (. 


om the population 
ate the trend as it might 
). Line determined by the method of least Squares 


It will facilitate the discussion to introduce some symbolism before 
presenting the specific methods to be used in the determination of the 
equation of the regression line. For a given value X; of the measure- 
ment X, let the corresponding value of Y be called Y; if it was observed 
with X; when the sample was taken. It will be designated as Y; if it 
is calculated from the equation of the regression line. Also, let the 
general linear equation relating P; and X. i be written in the form 


(7.21) f;-a-b(X,;— z)* 


where a and b are the two const 


ants which must be determined in 
order to have a specifie trend lin 


e for a particular scatter diagram. 
* This form and the notation do not agree entirely with some other textbooks, 
but they are used here for convenience. The (X; — 2) is Tj so that the subsequent 


formulas and discussions come quite simply from this form of the equation for f. 
Some authors use other letters than a and b; and severa] Others use b as herein 
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As stated above, the a and b will be calculated in terms’ of the col- 
lective amount by which a line misses the points of the scatter dia- 
gram. The a and the b also are considered estimates of the popula- 
tion parameters « and £ in the true linear regression equation, 


(7.22) Y=a + B(x = a), 


where y = the true mean of the X’s. 

For reasons given earlier for using MX — 2)? to measure variation 
about the mean instead of either MX — 2) or >| X-i |, it is found 
advisable to use E(Y — P)? to measure the scatter of the Y’s about 
the trend line. Therefore, the best-fitting straight line has been 
chosen as that one for which the =(Y — f)? has the least possible 
size. This action makes the standard deviation about the trend line as 
small as possible. The mathematical process of achieving this goal 
produces formulas from which the a and the b can be computed. When 
these values are substituted into formula 7.21 a specific equation of a 
regression line is obtained. This line will have the property that the 
standard deviation about it is as small as it is possible to make it for 
any straight line. In other words, the variability of the Y's has been 
reduced as much as it can be in consideration of their linear trend 


with X. 
The formulas for a and b are as follows: 
E yx, — a — 2) - = (ay) 
(7.23) a=g,andb= ZO - 5 IGA 


where j = mean of the Y's in the sample and y = the deviation of a 
Y from the mean, y. The >(xy)—which the student has not met 
before in this book—is (X1 — @)(¥1 — y) + (Xo DVD + 
+ (Xn — (Yn — 9) = UY + 222 To aln. 

Por dhe data of Table 7.22, a= Z(1)/m = 4300/00 = 14.03, 
b = D(y)/Z(x2) = 23.0200/87.912 = 0.6072, and 2 = 2(X)/n = 
226.8/30 = 7.56. Therefore, since Y + W(X —à) - bX + (y — bà), 


(7.24) f = 0.6072X + 10.04. 


Students in a statistics course are in an unusually fortunate position 
because when they take samples from laboratory populations they can 
see readily how well, or poorly, certain features of their samples agree 


with the corresponding features of the populations being sampled. 
in this field tend to think that Z(ry) = 
t D(a) = Z(X — 3) = 0 for any set of data 


* Experience shows that beginners 
that (zy) is not the same as Z(z)- 


Z(z).X(y). If the reader will recall tha 
—and likewise for E(y)—it becomes apparent t 3 
Z(y) or it always would be zero. This obviously is untrue. 
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For example, the slope of the above sample estimate of the linear re- 
gression line is caleulated to be b — 0.6072, whereas the true slope is 
known to be 8 = 0.5000. In actual practice, only the b is known, and 
it is necessary to measure its reliability as an estimate of B. This will 
be done later when the necessary techniques have been discussed ; but 
it can be stated here that if the sample has been taken with the X's 
fixed—as suggested in Figure 7.21—so that there is no sampling error 
in X or in &, the b as defined is an unbiased estimate of the parameter £. 

The value Y which is obtained from formula 7.24 by substituting a 
partieular value for X is described as the estimated average Y for that X. 
For example, if X is taken as 5, Y = 0.6072(5) + 10.04 = 13.1, ap- 
proximately. By reference to Figure 7.21 we learn that this estimate 
is somewhat low because the true average Y for turkeys weighing 5 
pounds at 16 weeks of age is 13.5 pounds. If X is taken = 8, Y = 
14.9 pounds, which is nearer to the true average Y of 15 pounds than 
was obtained when X — 5 and the true Y was 13.5 pounds. It will 
be seen in a later discussion that greater accuracy in estimating the 
true average Y is to be expected for X's near the mean X. There 
often are more sample data near the mean; but also errors in estimating 
the £ will cause the ends of the trend line to be swung farther from the 
true position than is the middle of the line. In the above example 
the slope was b = 0.6072 instead of 8 = 0.5000; hence the line deter- 
mined from the sample is too steep and therefore too low at the left- 
hand end. This appears to be the major reason why the estimate of 
the true average Y for X — 5 was too small. Of course, the general 
height of the sample line must be in error to some extent, and this 
also contributes to the inaccuracy of any estimate made from the 
sample trend line. 

The method described for obtaining the straight line which fits a 
linear trend best is called the method of least squares because it 
makes the sum of squares of the vertical deviations of the points of 
the scatter diagram from the regression line the least it can be made 
for any straight line. Table 7.23 has been prepared to illustrate 
specifically the meaning of this minimization. Columns 6, 7, and 
8 were obtained from the equation given over the right-hand side 
of the table. This equation represents a straight line which appears 
to the eye to fit the trend of the scatter diagram about as well as 
the line obtained by the method of least Squares, as can be verified 
from Figure 7.22, which shows both lines. 

It should be noted that the total of the fifth column of Table 7.23 is 
less than that of the eighth column. This will always be true no mat- 
ster which straight line is used to obtain f; as long as the equation is 
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TABLE 7.23 


ILLUSTRATION or Some FEATURES or THE METHOD or LEAST SQUARES 
Ustne Dara or TABLE 7.22 


Method of Least Squares Freehand Straight Line 
f = 0.6072X + 10.04 f; = 0.88X + 7.92 
X Y f Y-Y (Yr-—fy f; Y-B YoY) 
0.0 Q3) (4) (5) (6) (7) (8) 
4.8 12.9 12.95  —0.05 0.0025 12.14 +0.76 0.5776 
6.5 14.8 13.99 +0.81 0.6561 13.64 +1.16 1.3456 
6.4 14.4 13.93 +0.47 0.2209 13.55 +0.85 0.7225 
5.5 13.5 13.38 +0.12 0.0144 12.76 +0.74 0.5476 
6.1 13.4 13.74 —0.34 0.1156 13.99 +0.11 0.0121 
6.0 13.6 13.68  —0.08 0.0064 13.20 +0.40 0.1600 
74 14.3 14.53 —0.23 0.0529 14.43  —0.13 0.0169 
7.0 14.5 14.99 +0.21 0.0441 14.08 +0.42 0.1764 
7. 18.3 14.4] -—1.1 1.2321 14.26 —0.96 0.9216 
7.0 15.0 14.29 +0.71 0.5041 14.08 +0.92 0.8464 
6.9 14.7 14.23 +0.47 0.2209 13.99 +0.71 0.5041 
9.8 15.1 14.17 +0.93 0.8649 13.90 +1.20 1.4400 
7.2 18.3 1441 —1.11 1.2821 14.26  —0.96 0.9216 
7.5 14.3 14.59 —0.29 0.0841 14.82  —0.22 0.0484 
8.5 15.1 15.20 —0.19 0.0361 15.40  —0.30 0.0900 
8.5 15.2 15.20 0 0 15.40 —0.20 0.0400 
7.9 15.9 14.84 +1.06 1.1236 14.87 +1.03 1.0609 
8.0 149 1490 0 0 14.96  —0.06 0.0036 
8.1 15.0 14.92 +0.08 0.0064 15.05  —0.05 0.0025 
8.0 13.8 14:90 —110 1.2100 14.00  —1.16 1.3456 
7.6 14.9 14.65 +025 0.0025 14.61 +0.29 0.0841 
7.5 14.1 1459 —0.49 0.2401 14.52 —0.42 0.1764 
7.5 13.9 14.59 —0.69 0.4701 14.52 —0.62 0.3844 
$2 14.4 15.38  —0.98 0.9604 15.00 —1.26 1.5876 
215.7 15.63 +0.07 0.0049 16.02 —0.32 0.1024 
9-0 15.9 15.50 +0.40 0.1600 15.84 +0.06 0.0036 
25 14.4 15.26  —0.86 0.7396 15.49 —1.09 1.1881 
es 15.3 15.57 —0.27 0.0729 15.93  —0.63 0.3969 
57 16.6 15.50 +1.10 1.2100 15.84 +0.76 0.5776 
:2 16.8 15.63 +1.17 1.3689 16.02 +0.78 0.6084 
Sums 12.9226 = Z(Y = f)? 15.8926 


Rot obtained by the method of least squares, and as long as sufficient 
accuracy is kept in the calculations to pick up small differences. 

his is the basis for the statement that the method of least squares 
Makes the standard deviation of the Y's from the trend line as small 
as possible for any straight line, which is a strong argument for the 
‘se of this line in practice. 
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PROBLEMS 


1. Obtain the linear equation whose graph fits the points of Figure 7.11B best 
in the sense of the method of least squares. Graph the line on the scatter diagram 
and indicate graphically those deviations whose sum of squares is the least possible 
for any straight line. 

2. Do as in problem 1, for Figure 7.114. Also compute (Y — f)? 

Ans. Y = 2.16X + 4.54; (Y — f)? = 9.77. 

3. By what average amount would you expect Y to increase for a unit increase 


in X if the data corresponding to Figure 7.114 constitute a representative sample 
of some two-variable population? 


4. Compare the X(Y — f)? and Z(Y 
What conclusions can you draw? 

5. Use the method of least Squares to estimate for the data of Figure 7.11B the 
average value of Y for X = 1.5, 2.5, 3.5, and 4.5, respectively. 


6. Make up two sets of 10 pairs of observations each and such that b is about 2 
in one set and about, —3 in the other. 

7. Write down the equation of a trend line 
10 when X — 4. Graph this line, and then 
the trend and has X(Y — f? = 50, 

8. Do as in problem 7, with slo 


— 9) for the data of Figures 7.11B and G. 


with slope = 5 and for which f = 
construct a scatter diagram which fits 


pe = —3 and everything else the same. 
9. Assign row and column numbers to the data in Table 721. Then draw 


two random samples of 30 pairs each—as in Table 7.22—and obtain the least- 
Squares regression line from each sample. Plot these lines on their correspond- 
ing scatter diagrams and discuss their differences, (Round off each X to the 
nearest pound before doing your computations.) 

10. “Cull” the flock of Table 7.21 at 16 weeks of 
which weighed under 6 pounds at th. 
B still be 0.5 for this population? 


age by eliminating all turkeys 
at time, then do as in problem 9. Would 


7.3 MEASUREMENT OF THE VARIATION ABOUT A 
LINEAR TREND LINE DETERMINED BY THE 
METHOD OF LEAST SQUARES 


y distributed variable, 
should be measured by 


' the standard devia ean, and estimated from 
sy = VX(Y; Dn — 1) because sy? is an unbiased and highly 


efficient sampling estimate of cy”. The variation measured by sy is 
then considered to be sampling variation. However, if for each Yy 
there is an associated measurement, X;, and if the X’s and Y's tend to 
be linearly related, not all the apparent variability among the Y; 
should be assigned to mere sampling errors. Part of it can be ac- 
counted for in terms of the varying X; associated with the Y;. For 
example, if Y tends to increase about 5 units for 
the magnitude of X the Y associated with X 


each unit increase in 
= 10 is expected to be 
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about 15 units greater than the Y for X — 7; hence some of any ob- 
served difference between the Y for X — 7 and the Y for X — 10 can 
be accounted for and need not be considered as sampling error. 

Graphically the preceding remarks mean that if Y and X can be 
considered to be linearly related the standard deviation of the Y; 
should be calculated from the trend line rather than from the horizon- 
tal line: P = j. This means that the quantity X(Y — Y)? should be 
employed in this calculation instead of X(Y — 9)”. However, the divi- 
sor in this calculation will not be (n — 1) as it is for sy, above. 

The divisor needed in the computation of the standard deviation 
about the trend line is (n — 2). The reason for this cannot be given 
conclusively without mathematical analysis which is not appropriate 
to this book; but it ean be rationalized in the following manner. 
Suppose that a sample of 5 observations on X and Y simultaneously 
Were as follows: 


Il 
w 


X: 


2.8 
E 4 6 


4 
8 


an” 
N Ct 
Se 
[or] 


It is readily determined that f = 0.80X + 3.6; hence the following 
table can be set up for purposes of illustration: 


Xe 1 2 3 4 5 
Y: 5 4 6 8 7 
f: 44 52 60 Y, Y; 
(Y;— Po): 06 —12 0 ? 1] 


What are the deviations from the trend line for X — 4 and X — 5; 
respectively? The fact that (Y; — P;) = 0 will be found to account, 
for one of these deviations. The fact that b = 0.80 will allow the de- 
termination of the second unknown deviation. 

Let the unknown deviations Y4 — P, and Y; — f; corresponding 


T X = 4 and X = 5 be denoted by v and w, respectively. It follows 
hat T 


AY, — P) = 0.6 + (-1.2) +0 4- v +w — 0, 
Which reduces easily to 
(7.31) v+w = 0.6. 


The slope of a straight line ean be computed by determining the 
Amount by which Y changes for any chosen change in X, and by taking 
© ratio of the former to the latter. For example, if it is determined 
Y measurement on the graph or by substitution into a mathematical 
formula that for the interval from X = 1 to X = 5 the height of the 
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straight line above the horizontal axis increases from 10 to 30, the 
slope of this line is measured by 20/4 — 5. Hence, in the situation of 
the preceding paragraph the slope is given by (Y5 — Y4)/(X5 — Xa). 
But (X5 — X4) = 1, and the slope is known to be 0.80; hence (Pz — f) 
+1 = 0.80. In order to transform this equation into one involving v 
and w consider the following two equations: 


Ys — Y, = Ys — Ya 
Ta Y, = 0.80. 


When the left and right members of the second equation are sub- 


tracted from the corresponding members of the first equation, it is 
found that 


. (Ys — Ys) — (Ya — 4) = Ys — Y, — 0.80; 


but Ys —f;—-w,Y.— Yi = v, and Y; — Y4 = —1; therefore, 


(7.32) v—w=18. 
When equations 7.31 and 7.32 are solved simultaneously it is found 
that v = 1.2 and w = —0.6. Hence, two of the deviations from the 


trend line can be calculated from the size of b and from the fact that 
E(Y — PY) =0. Although there are five actual deviations from the 
linear trend line, only three (any three) of them actually should be 
considered chance deviations from the regression line determined from 
the sample. Hence, in the present problem, n — 2 = 5 — 2 = 3 will 
be used as the divisor of Z(Y; — Pj)? in the computation of the stand- 
ard deviation about the linear trend line. 

This divisor, n — 2, is generally called the number of degrees of free- 
dom for the estimated standard deviation about the linear trend line, 
just as the number, n — 1, is the number of degrees of freedom for the 
estimated standard deviation, sy, about the mean. 

With the above discussion as a background, the formula for the 
estimated standard deviation of the Y; about the trend line becomes 


(7.33) syz = VIY: -P/a — 2), 
wherein the symbol, s,.z, is read “s sub y dot x.” 

For the data used as illustration in this section, (Y; — Y)? = 3.60, 
^ — 2 = 3; hence s,., = W3.60/3 = 1.10. This is a measure of the 
variation among the Y-measurements which remains unexplained even 
after the linear trend with X has been taken into account. When the 
trend with X is ignored, sy = V10/4 = 1.58; so s,., is 0.48 of a unit 
smaller than sy. In other words, the variability of the Y; (as measured 
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by the standard deviation) has been reduced 100(0.48)/1.58 — 30.4 
per cent by taking the linear relation between the two measurements 
into account statistically. Such success in accounting for part of the 
variation among the measurements, Y;, clearly is important in statisti- 
cal analyses because the only occasion for such analyses arises as a 
result of variability among numerical measurements. 

The standard deviation about the trend line, Sy-z, also is specifically 
useful in certaif applications of linear trend analysis, two of which will 
be considered. The regression coefficient, b, estimates the average 
change in the Y-measurement for each unit increase in the X-measure- 
ment. Its accuracy as such a measure is of interest, and its accuracy 
is measured by its standard deviation. The standard deviation of b 
is shown in more advanced statistics courses to be 


(7.34) ye. 
T Mx 
For the data of Table 7.22: 3(Y — f)? = 12.9226, Za? = 37.9120, 
^ = 30, and hence Sy.» = V 12.9226/28 = 0.679. Therefore, sẹ = 
0.679/3/37.9120 = 0.110, approximately. 
It can be shown that the ratio, 


(7.35) t= (b — B)/s», 


Where @ = the true regression coefficient which is estimated by b, 
follows the same £-distribution as that summarized in Table IV with 
^. — 2 degrees of freedom. Therefore, a confidence interval can be 
Computed for 8, and it can be interpreted in the manner previously 
shown. For Table 7.22, the 95 per cent confidence interval is ob- 
tained as follows: 


—2.05 < (0.6072 — 8)/0.110 € +2.05 


will be a true inequality for 95 per cent of all samples with 28 de- 
Srees of freedom. Hence the 95 per cent confidence interval is found 
to be as follows after some simplification of the preceding inequality: 


(30) , 0.38 < B X 0.83. 


It would be concluded in practice that the slope of the true lincar 
regression line is some value between 0.38 and 0.83, but it is recog- 
nized that there are 5 chances in 100 that the sample has led us to a 
alse statement. A more useful statement might be that it is esti- 
Mated from (7.36) that a turkey which is one pound heavier than 
another at 16 weeks of age will, on the average, be 0.38 to 0.83 pound 
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heavier at 28 weeks of age. That is, the lighter turkeys at 16 weeks 
tend to catch up some, but they usually remain 0.38 to 0.83 pound 
lighter at 28 weeks for each pound that they were lighter at 16 weeks 
of age. 

Another application of linear trend analysis which makes use of 
Sy-z is one in which Y is to be estimated for some unobserved value of 
X; for instance, for X = 9.5 pounds at 16 weeks. If X is set equal to 
9.5 in formula 7.24, f = 0.6072(9.5) + 10.04 = 15.8 pounds at 28 
weeks of age. How reliable is this estimate? A look at the scatter 
diagram leaves only the impression that this estimate should be fairly 
reliable; hence a more specific measure of its accuracy is needed. The 
standard deviation of Y is given by the following formula: 


(7.37) sp = sj V 1/n + (X — 8)! /2i3, 


where X is the value used to calculate the f. This estimate of the 
standard deviation of Y is based on n — 2 degrees of freedom, as ex- 
plained earlier. It will be convenient in subsequent discussions to add 
“with n — 2 D/F” after an estimate of this sort to indicate the num- 
ber of chance deviations upon which the estimate is based. In the 
example considered in this paragraph, 


sp = 0.679-V'1/30 + (9.5 — 7.56)2/37.912 = 0.679(.364) 
= 0.247, with 28 D/F. 


This standard deviation applies when the X's have been chosen in 
advance and are not subject to sampling error. As noted earlier, the 
b is then an unbiased sample estimate of the population parameter, $. 
Under these circumstances the formula 7.37 can be partially ration- 
alized in lieu of a more rigorous demonstration of its validity. The f 
for a particular X, say X;, is obtained from f = G+ (Xi 
+ xb. Hence the variance of f; is obtained from the v. 
sum, jj + xi, in which the z; is a fixed number. The variance of y for 
this particular X will be one-nth of the variance about the trend line, 
or $,.,^/n. The variance of b is s,.,2/Xz?, as noted earlier, and 2; is a 
constant; hence the variance of x;b = T? -Sya /X(z2). If the variance 


of the sum, Y + zjb, is just the sum of the variances of those two terms, 
it follows that 


-&b-jj 
ariance of a 


E 2 
[ 4.8, 


yr 
n Z(a?) 


"T JU 
sp m 


= sy. (1/n + 22/2(a?) 


so that sẹ = sys V/1/n + 21/2(2?) , as in formula 7.37 for a particu- 
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lar X = X; It is true that the variance of jj + zb is the sum of the 
variances of the two terms, but this will not be proved here. 

It can be shown that the ratio 
(7.38) asm 

sf 
where uy». is the true average Y for the given X, follows the ¢-distribu- 
tion with » — 2 degrees of freedom. This fact makes it possible to place 
a confidence interval on uy- with any appropriate confidence coefficient. 

The meaning of the uy.» can be made clearer by reference to Figure 
7.21. For each particular X there is a frequency distribution of the 
corresponding Y's. This distribution of Y's has a true arithmetic 
mean, which is the u,.. for that X. 

If we wish to make an interval estimate which applies to an indi- 
vidual rather than to a group mean, we must take account of the 
greater variation exhibited by such individuals as compared to the 
group. For example, suppose that a study has been made of the rela- 
tionship between the ages of Kansas females and their basal metabolism 
rates as expressed in calories per square meter of surface area per hour. 
It is supposed that the age interval chosen is such that a linear relation- 
ship exists between these two variables, and that the least-squares 
equation for P has been obtained from a sample. Suppose, further- 
more, that the equipment needed to determine the basal rate is not 
available in a certain area, and a Kansas woman 25 years of age wishes 
àn estimate of her basal metabolism rate as a matter of interest. The 
best point estimate is the Y calculated for X = 25; but when an inter- 
val estimate is needed—and it is more useful in the present problem— 
account must be taken of the fact that this woman is not supposed to 
be an average person representing all those who are Kansas females 
25 years of age. She is regarded as one particular person who wishes 
an estimate of her own basal rate. In this circumstance the variance 
of Y used earlier in this section is not correct because it includes only 
two sources of variation: one from the mean and one from the sampling 
regression coefficient. In the present problem a third source must be 
included, namely, individual variation about the mean. When the 
particular X has been taken into account, this additional variance is 
Just sps”; hence—again it turns out that this can be added to the 
Other two components—we obtain the following formula for the vari- 
ànce of the Y for an individual: 

"TM 1 (X- Y 
sp Eye [: + + zi) 
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When this change in sf is made in formula 7.37 we obtain the formula 
for sẹ which is employed in the following t-ratio: 


Y — Hyszi 
(7.38a) í-————, 
sg 
where py.2; = the true Y-value for the ith individual for whom X = X;. 
Formula 7.38a and the usual methods make it possible to obtain a 
confidence interval on yy.» with any specified confidence coefficient. 
It should be both clear and reasonable that such a confidence interval 


will be longer than a corresponding one from formula 7.38 because the 
standard deviation is larger. 


Problem 7.31. You are about to buy one turkey which weighs 6.5 pounds at 
16 weeks of age, and you are going to keep it until it is at least 28 weeks of age. 
What is the 94 per cent confidence interval on its 28-week weight, assuming it 
comes from the population sampled in Table 7.22? 


It was seen in the discussion of Table 7.22 that Y = 0.6072X + 
10.04, which = 13.99 pounds for X = 6.5 pounds. Also, Sy.» = 0.679 


pound, and Z(z?) = 37.9120; hence by formula 7.39, after taking the 
square root, 


sp = 0.079 V/1 + 1/30 + (6.5-7.56)2/37.9120 = 0.700 


is the standard deviation of P for X = 6.5 pounds at 16 weeks of age. 


Therefore, t = (13.99 — u,.,,)/0.700, and the 94 per cent confidence 
interval is derived from 


—2.19(0.700) < (13.99 — py.2;) < 2.19(0.700). 
It is found that 
Clos is 12.5 < myz; X 15.5, 
to the nearest one-half pound. 

Notice that the situation just considered clearly is one in which 
the confidence interval for a particular turkey is required. You do 
not have a group of turkeys so that the high 28-week weights of some 
of them can be expected to offset the low 28-week weights of others. 
"Therefore, you must face the fact that this partieular turkey's weight 


at 28 weeks of age may be quite low, as well as quite high, for turkeys 
weighing 6.5 pounds at 16 weeks of age. 


Problem 7.32. Suppose that problem 731 is changed to state that you have 


bought a rather large flock of 6.5-pound turkeys each 16 weeks of age. Compute 
the CI, appropriate to this new situation. 
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The only change in the computations is that the standard deviation 
of P now is 


sp = Sy: V 1/n + a?/X(x?) = 0.679(0.251) = 0.170 pound 


instead of the 0.700 pound obtained for the individual. It follows 
that the required confidence interval is: 


Clos: 13.5 pounds < Ay». < 14.5 pounds, 


to the nearest one-half pound. This is a narrower interval than is 
obtained for problem 7.31, as should be expected. 


PROBLEMS 


1. Compute sy and sy. for the data for Figures 7.114 and G and relate their 
comparative sizes to the scatter diagrams. 

2. Work problem 1 for Figures 7.11B and G. Does the downward trend of the 
points on a seatter diagram, as contrasted with an identical upward trend, have 
anything to do with the comparison between sy and sy.z? 

Ans. B: sy = 16.0, syz = 3.1; G: sy = 9.2, Syz = 9.8. 

3. Referring to Figure 7.11E and the associated data, compute and compare 
sy and sy.z as in problem 1. Could you have predicted from the scatter diagram 
that they would be of essentially the same magnitude? Why? 

4. By visual inspection of Figures 7.11C, D, and F what do you conclude about 
the comparative sizes of sy and sy.z for each figure? 

5. For the two sets of data in problem 7, section 7.1, compute the percentage 
reduction in the standard deviation of the Y; achieved if variability is measured 
About the linear trend line rather than about the line f = ġ for each set. Discuss 
the two results obtained for the two sets in terms of the curvilinear trend in one set. 

6. Use the data of problem 1, section 7.1, to estimate the average Y for X — 13. 

Iso compute the standard deviation of this estimate, first considering a group 
with X = 13 and then for an individual with X = 13. 
Ans. P = 345.7 when X = 13; sf = 0.53; 1.01. 

7. Use the data of set B, problem 7, section 7.1, to place 92 per cent confi- 
dence limits on the log(weight) of the 7-day-old bee larvae of the kind repre- 
Sented by this sample. Interpret these limits. 

8. Compute the 99 per cent confidence interval on £ for problem 1, section 
74, and. draw appropriate conclusions. Ans. Clgg: 212-2 8 £ 23.0. 

9. 'The following data express the farm population (as defined for the 1950 
Census) as a percentage of the total U. S. population: 


Year: 1940 1941 1942 1943 1944 1945 1946 1947 1948 
Per Cent: 21.8 21.5 20.6 18.8 17.7 17.3 17.9 17.0 16.9 
1949 1950 1951 
16.7 16.0 15.0 


"These are not sampling data, but the fitting of a trend line to these data may 
be useful anyway. For example, if the war years, 1943 to 1945, inclusive, are 
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ignored, the downward trend in percentage farm population is quite closely 
represented by a straight line. Make a scatter diagram of the above data, 
omit 1943, 1944, and 1945 from further consideration, fit a linear trend line by 
the method of least squares, and then read from the line the approximate per- 
centages for the years omitted. Would the discussions of estimates of the 
standard deviation and the formulas given in this book for them be appropriate 
here? Give reasons for your answer. 

10. Referring to problem 9, could you use the equation obtained there to 
predict satisfactorily the percentage farm population for 1953? For 1960? 
Justify your answers. 


7.4 COEFFICIENTS OF LINEAR CORRELATION 


It is not always desirable—or even appropriate—to obtain an equa- 
tion for the linear relation between the two types of measurements 
being studied, as was done earlier in this chapter. It may be better 


to describe the relationship as linear, and to give a standard, unitless, 
measure of its strength, or closeness. 


efficient of linear correlation. 
Although correlation coefficients are widely used, and often with- 
out, attention to the satisfaction of necessary assumptions, it should 
be kept in mind that, strietly speaking, both X and Y must be random 
‘variables which follow normal frequency distributions. This will be 
assumed to be true in the following discussion of this section. 


This is the purpose of a co- 


It has been seen that the variance of the observed Y's about the 
least-squares regression line depends on the size of E(Y — P?, in 
which P = 7 + br. Hence the magnitude of this variance depends on 
X(Y — y — bx)? = Ay — be)? = My? — 2bry + 0x2). But the last 
summation can be computed in three parts as follows: 


Dy? — bry + br?) = Ey?) — 2b(xy) + bI, 


z(5-2 a -2y)] 


AP, 
+ ES :Z(a?), 


Z(y2) — 2[E(ey)| (py? 


(2?) zi)! 


hence, 


o a Eley)? 
7.41 L(Y — P? = Ay? : 
(7.41) EA = a 


f 
| 
| 
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"Therefore, it is clear that the observed variability of the Y's about the 
regression line will be large or small according to the size of Ey? 
+ D(x) compared with the size of E(y?). If [S(21)1P/2(2?) is multi- 
plied by 3(y?)/2(y?)—which equals 1 and only changes the form of 
the quantity by which it is multiplied—it follows from (7.41) that the 
(Y — Y)? can be expressed as follows: 


P 2 
ay - *y «ob 1]. 
2(2%) EQ) 
» 2 
Clearly, the quantity Bent has the following statistical features: 


DG20) 


(a) Its value cannot be less than zero nor more than one because it 
is essentially zero or positive, and if it exceeded one the sum of 
squares of deviations from the trend line would be negative, which is 
absurd. 

(b) If this quantity is near zero there is about as much scatter of 
the sample points about the trend line as about the horizontal line 

f= 7; hence there is little or no linear trend. 

(c) If this quantity is near one there is very little scatter about 
the regression line; hence the sample points lie quite close to that line. 

(d) As the size of this quantity varies—for different samples— 
from zero to one the scatter of the sample points about the least- 
Squares regression line varies from a completely trendless, shot-gun, 
pattern to a perfect fit to a linear trend line. 

(e) "This quantity is unitless so that the features noted above are 
true regardless of the units in which Y and X are measured. 

(f) In its present form this quantity cannot distinguish between 
positive and negative slopes of trend lines, but its square root would 
have the same sign as b and would make this distinction if the square 
root of the denominator were always taken as positive. 


It follows then that the square root noted in f, 


Z(zy) 

eae RSE TÉ 

V/Z(?):EQ?) 
is a unitless number within the range —1 £r £ +1 which indicates 
the direction and strength of the observed linear trend. "This number, 
r, is called the product-moment coefficient of linear correlation be- 
tween two measurements X and Y. It obviously is subject to sam- 
pling variations and therefore has a sampling distribution. It is a 
sampling estimate of a corresponding population parameter indicated 


(7.42) 
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by the Greek letter p (rho), which succinctly describes the degree of 
scatter of the population points about the true linear trend line as, 
for example, in Figure 7.21. lí p — 1, all the points will lie on the 
regression line. Since they do not—in Figure 7.21—there is sampling 
error in the estimation of p, and hence the r varies from sample to 
sample. This is similar to the situation when the true regression 
coefficient, 8, was being estimated from samples. 

If the p is zero, all the sampling estimates r; will not be zero but 
will have à sampling distribution which is nearly normal in form. 


In such circumstances it can be shown that the following ratio follows 
the £-distribution with n — 2 D/F. Thus 


(7.43) c T  PVn-—2 


E mA 
n—2 ` 


can be used in the usual manner to test the hypothesis Ho(p = 0). 
As was seen in earlier discussions, Ho will be rejected whenever the 
size of t becomes so great that it is unreasonable—according to some 
predetermined standard—to believe that this t is the product of sam- 
pling variation. 

It is more difficult to place a confidence interval on p than on B 
because r is not nearly normally distributed when p750. This 
process of computing a confidence interval on p will be discussed and 
developed somewhat heuristically by means of the empirieal data 
found in Table 7.41. These data were obtained by drawing random 


TABLE 7.41 


OBSERVED SAMPLING DISTRIBUTIONS OF r AND 2 = 


1/2 Loa, [(1 1— 
FOR n = 12 AND p = +.749 Mo One 2: 


r-interval F 


z-interval Vs 

.890-1.000 7 1.70-1.89 1 
-790- .889 32 1.50-1.69 3 
.690- .789 48 1.30-1.49 9 
.590- .689 52 1.10-1.29 22 
.490- .589 23 0.90-1.09 43 
.390- .489 12 0.70-0.89 56 
-290- .389 8 0.50-0.69 29 
.190- .289 5 0.30-0.49 19 
.090- .189 2 0.10-0.29 7 
—.010- .089 1 —0.10-0.09 1 

Total 190 


Total 190 


- 
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samples of pairs of values of X and Y from Table 7.21 for which 
p =.749. This population was considered to be approximately a 
normal bivariate population. The 190 sampling »'s thus obtained 
are summarized in Table 7.41 along with the corresponding z's (see 
discussion of z below Figure 7.41). The distribution of r for a p so 
large as this is definitely skewed, as can be seen to some extent in 
Table 7.41 and in Figure 7.41. 


60 


> a 
o o 


Frequency (f, or f.) 
[^] 
o 


=.20 0 20 40 60 80 100 120 140 160 180 
rorz 


Figure 741. Sampling frequency distribution of the correlation coefficient, 7, 
and of the corresponding z = (1/2) log, [(1-- /(1 — 7)1. n = 12. 


It was found by R. A. Fisher that under these circumstances it is 
helpful to use the following function of r: 


: 1 
(1/2) loge (=) = (2.30259/2) togo | ES 4 
in li 


(7.44) z 


ll 


Sip 
1.1518 logio | —| 


because its sampling distribution is essentially normal in all important 
features even when p is definitely +4 0. Moreover, its variance is 
given by c;? = 1/(n — 3). This is not a sampling estimate but the 
true variance of z. It follows that, as a good approximation, the 
quantity y = (z — 27)/0,, where zp is the z corresponding to p in 
(7.44), is normally distributed. Hence, Table III gives the probabili- 
ties needed in tests of hypotheses regarding p or in the calculation of 
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confidence intervals on p. For example, consider the first sample 
drawn for Table 7.41. Then = 12and r = .668; therefore, by formula 
7.44, 


1.668 
= 1.15131 ( ) = 1.1513 logo 5.024 = 0.807, and 
2 0810 0.332 S10 E 


c, = 1/V9 = 0.333. 


Then, since y = (0.807 — zp)/0.333 is a member of a standard normal 
population, the probability distribution of Table III can be used. If 
a confidence coefficient .95 is chosen, the inequality 


.807 — z, 
—1.96 € ——— < +1.96 
0.333 
requires that 


0.154 € z, € 1.460 
unless a 1 in 20 chance has occurred in this sample. The corresponding 


95 per cent confidence interval on p is obtained by using formula 7.44 
and solving for the p, which now replaces r. Thus, 


al 1 
Za = lower limit = 0.154 = — los ( = 2) i 
2 l—p 


lc» , 
p = 999 But, logio 09:308 = 0.308 logio e 
— B 
— 0.308(0.4343) — 0.134; and 
" 1 
Anti-log 0.134 = 1.36 = LT”, 
1 = fy 


Hence 2.360; = 0.36 so that p, = .155. Similarly p = upper limit 
of the 95 per cent confidence interval = .898; therefore, the 95 per 
cent confidence interval on p is 


-155 € p < .898, 


which is a very wide interval but does include the true p, known in 
this case to be .749. If a relatively narrow confidence interval is 
needed, it is apparent that a rather large sample must be taken. 
Figure 7.41 shows that the sample correlation coefficient, varies over 
a considerable range even when p is as large as .749. As a matter of 
fact one sample r out of 190 was negative in spite of the relatively high 
positive correlation. This figure shows also, to a useful degree, the 


normalizing effect of the transformation z — 5 logs ( + A Given 


L= 
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a large number of sample correlations, the z-curve would become 
approximately normal in shape, as can be imagined from Figure 7.41. 


Problem 7.41. Finney and Barmore (Cereal Chemistry, Vol. 25 [1948], page 
299) have reported that the linear correlation between the per cent of protein 
in Nebred wheat flour and the loaf volume of bread baked therefrom was 
r = 94 on a sample of 30 pairs of measurements. What useful information does 
this provide? 


The mere fact that 7? = (.94)? = .8836 tells us that 88.36 per cent 
of the original sum of squares of the loaf volumes (Y) about their 
mean, jj, can be'associated with the linear inerease of that measure- 
ment with increasing protein concentration in the flour (X). Loaf 
volume is an important factor when the quality of bread is judged, and 
it is important to know what affects it. 

It is inconceivable that such a large correlation coefficient would be 
obtained accidentally on thirty random observations; but, to illustrate 
the method, the hypothesis //o(p = 0) will be tested. It is seen that 


94 i 
t = ——= = —— = 145, 28 D/F. 
F: — .8836 0.065 
28 


Such a large ¢ would occur by chance almost never; hence the hy- 
pothesis Ho(p = 0) is decisively rejected. We know without even 
Seeing the scatter diagram that the sample points lie closely about 
à linear regression line whieh has an upward trend. It also is ap- 
parent that the loaf volume from Nebred flour meeting the condi- 
tions of this experiment could be predicted quite accurately from a 
knowledge of its protein concentration. 

There are some circumstances under which it is desirable to deter- 
mine if two random samples probably were drawn from the same 
bivariate population as regards one, or both, of 8 and p. For example, 
it might be of interest to learn if one method of raising turkeys pro- 
duces a more consistent relationship between the 16-week and the 
28-week weights so that we could cull at 10 weeks of age with more 
confidence. Such an improvement in the relationship between these 
variables would indicate that the true coefficient of linear correla- 
tion, p, had been increased by the new methods. It also might be 
that superior poultry husbandry could increase the amount by which 
a weight advantage at 16 weeks of age would be followed by a weight 
advantage at 28 weeks of age. In the population considered e 


: 1 arlier 
in this ehapter, a one-pound advantage in weight at 16 weeks 


of age 
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was reduced, on the average, to only a one-half-pound advantage at 
28 weeks of age. Thus 8 was 1/2. It might be that the size of B 
could be increased by superior breeding and handling. 

If two samples—from two methods of breeding and raising turkeys, 
for example—have resulted in the computation of bı, ba, r;, and 7», 
the testing of the two hypotheses Ho(gi = £2) and Ho(pi = p») can 
be earried out as follows: 


For Ho(B1 — 83): 


(a) Pool the Z(Y; — f)? and the Z(Y; — Y;)? from the two sam- 
ples, pool the degrees of freedom, and calculate 


n - Y + (Y — Pj 
pooled $,. = 4 [—————————— ———————- 


m-T2—4. 
b) C te 4 | i + 1 
(b) Compute Sar) Zap 


(c) Multiply the standard deviation from a by the result obtained 
in b. This is the estimated standard deviation of (b; — ba), which 
will be called s»,_»,. 

(d) Compute t = (by — b2)/s5, 1, assign it ny + ng — 4 degrees of 
freedom, and interpret as before with respect to the acceptance or the 
rejection of Ho(81 = Be). 


For Ho(o1 = p2): 


(a) Transform the rı and the rz to zı and 2», respectively, in the 
manner described earlier in this chapter. 


il 1 
zi: 


b) C te oc4-4 = r 
(b) Compute c;., cmi om 


(©) Calculate y = | zı — z2 \/o z,—, and consider this ratio as a 
normally distributed quantity in deciding whether or not it is so large 
that the hypothesis Ho(o1 = p2) should be rejected. 


If it seems appropriate after a hypothesis has been rejected, con- 
fidence intervals can be determined for the difference Bi — B», but 
not for p1 — p». 

It is useful at times to have a convenient tabular procedure for 
computing b and r when the data are sufficiently numerous to justify 
the use of frequeney distribution tables. Such data rarely would 
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come from sampling studies, but perhaps they occur in practice often 
enough to justify the inelusion here of a method for obtaining the 
r and the b. 

As in Chapter 2, the computations will be carried out in units of 
the class intervals. A two-way frequency distribution table is needed 
because two variables are involved. These matters, and others, are 
illustrated and discussed by means of 280 pairs of observations of 
16- and 28-week weights of female turkeys similar to those studied 
earlier in this chapter. The symbol, X, is used to denote the 16-week 
weights and Y will stand for the 28-week weights, as before. Now 
that two variables are being considered simultaneously, frequencies 
in the X-classes will be symbolized by fx, those for the Y-classes by 
fy. When it is desirable to indicate both the X and the Y for a class 
of data, fyx will denote the frequency in that “cell” in the two-way 
table. Also, there may be two different lengths of class interval, Ix 
and Iy for X and Y, respectively. With these symbols in mind, the 


following formulas are seen to be analogous to those used previously 
for b and r: 


b= 2(fxy-dx-dy) — ((Zfx-dx)(Zfy-dy)]/Zf 
Z(fx:dx?) — (Zfx-dx)?/Zfx 


» and 


same numerator as that above for b 


V (same as denominator above)(same with Y replacing X) f 


The data of Table 7.42 are arranged in a two-way frequency dis- 


tribution table to provide a relatively easy basis for calculating b 
and r from their formulas as given above. 

The following computations are derived from the summaries in 
Table 7.42: 


Z(fr.dy?) — (Zfy-dy)?/Xfy = 800.5714, and its square root = 28.65. 
fx dx?) — (Zfx-dx)?/Xfx = 924.5679, and its square root = 30.41. 
2(fxy-dx-dy) — ((Zfx-dx)(Efy-dy)/Zf = 448.2857. 


PROBLEMS 


1. Calculate the r for the data of problem 1, section 7.1. 

2. Caleulate as in problem 1 for the data of problem 2, section 7. i 
2 M .l. Given 
DX? = 18484, XY E rs we = 28.6918. Ming. ra 2:06 

3. Compute Z(Y — f)? for the data of problem 3, section 7.1 i for 
No c nd EN on 7.1 by using the for- 


— 


Sec. 74 COEFFICIENTS OF LINEAR CORRELATION 225 


4. In the formula of the previous problem, take 3(y2) = 100 and plot the 
left member of this equation on the vertical scale against r on the horizontal 
scale. Take r from —1 to +1 by increments of 02. 

5. Reynolds, Bond, and Kirkland (USDA Tech. Bull. 861) give the following 
information on the relation between the cost of hauling logs and the length of 
the haul in miles over high-grade dirt or gravel roads: 


Miles hauled: 1 2 3 4 5 6 7 8 9 
Cost/1000 cu ft ($): 0.35 0.44 0.53 0.62 0.71 0.81 0.90 0.99 1.08 


Miles hauled: 10 11 12 13 14 15 16 17 18 
Cost/1000 cu ft ($): 1.17 1.26 1.36 1.45 1.54 1.63 1.72 1.82 1.90 


Miles hauled: 19 20 
Cost/1000 eu ft ($): 2.00 2.09 


Compute a coefficient of linear correlation between length of haul and cost per 
1000 cubic feet of volume, and draw conclusions. Is this really a proper use of 
correlation analysis? "Would a regression analysis be better? 

6. The persons mentioned in problem 5 gave the following data on the cost 
of producing 1000 cubic feet of hardwood logs in relation to the breast-high 
diameter of the logs: 


Diam. (in.): 10 11 12 13 14 15 16 17 
Cost ($): 12.70 12.63 12.38 12.03 11.62 11.32 11.10 10.84 


Diam. (in.): 18 19 20 21 22 23 24 25 
Cost ($): 10.63 10.49 10.40 10.28 10.13 10.04 9.96 9.88 


Make a scatter diagram of these data, compute r, and discuss it in terms of the 
Scatter diagram. Given: Ez? = 340; XXY = 3019.59; x(y?) = 14.3129. 
Ans. r = —.97. 
i 7. Compute sy-z and sy for the data of problem 5, with Y = cost per 1000 cubic 
eet. 
8. Calculate as in problem 7 for the data of problem 6. 
Ans. Sy.z = 0.23; sy = 0.98, 
9. The Yearbook of Labour Statistics for 1943-1944 gives the following average 
daily wages of Chilean copper workers, in pesos: 


Year: 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 
Wage: 11.89 11.26 11.75 11.33 12.80 13.31 14.77 16.37 21.31 23.20 25.34 
Z(zy) = 155.74; Y = 173.33; MY)? = 2996.1887. 

Construct a scatter diagram of these data, calculate sy and sy.z, and discuss their 
Sizes relative to the graph. 

10. Compute r for the data of problem 9—ignoring the fact that the year is 
not a random variable—and relate the size of the r to the appearance of the 
Scatter diagram. Let X = 1 for 1929, 2 for 1930, etc. Ans. r = +91. 

11. Estimate the average wage for the year 1940 from the data of problem 9, 
Using r in the computation of the standard deviation of this estimate. 
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12. Solve problem 1, and then test Hg(p = 0) and draw appropriate conclu- 
sions. Ans. r = 999, y 10, reject Hy decisively. 

13. For the data‘of problem 6 compute the CI, on f, and then interpret this 
interval in a practical way. 

14. For the data of problem 6 compute the CIgg on p and interpret this in- 
terval. Ans. — 96 = p= — 80. 

15. Suppose that two random samples of 15 observations each have resulted 
in the computation of r} = 75 and 7, = 65. Test Hy(p, = pa) and draw appro- 
priate conclusions. Also compute the CI,; for each parameter, p; and p,, and 
interpret these intervals. Can these interpretations be related to the test of 
Hy? 

%6. Draw a random sample of 30 observations from Table 7.21, compute the 
CIg, on p, and discuss the meaning of this interval. 

17. Draw a random sample of 30 from Table 7.21 and test the hypothesis: 
Ho(p-— 0). How frequently would this procedure result in the rejection of Hy 
when p= 0 (as in this case) at the 5 per cent level of rejection? 

18. Draw two random samples of size 30 from Table 7.21 and test H(p, = po), 


assuming that the first sample is from a bivariate normal population with p = py, 
and similarly for the second sample and p — Pa: 


7.5 RANK CORRELATION 


Sometimes it is cither necessary or convenient to correlate the 
ranks of X's with those of their corresponding Y's. It may be that 
the X’s and the Y's are only ranks in the first place, or it may be 
merely convenient to use ranks instead of four- or five-digit decimals, 
for example. 

The practice of correlating ranks is both older and broader in its 
applications than is sometimes realized. Karl Pearson apparently 
was of the opinion that the idea of correlating ranks originated with 
Francis Galton during his studies of inheritance. Sometimes C. 
Spearman is credited with doing much to develop rank-correlation 
methods, especially as applied in psychological studies. It is his 
coefficient, r, which will be discussed specifically below. The works 
of M. G. Kendall, and others, recently have increased the use of 
ranks in statisties to a considerable degree, but no attempt will be 
made herein to give an exhaustive treatment of this subject. 'The 
interested reader is referred to Kendall’s book, Rank Correlation 
Methods, published by Charles Griffin and Company, London. 

The calculation of the Spearman, or rank-difference, coefficient of 
linear correlation (r,) will be illustrated by means of the following 
pairs of ranks of students in two mathematics courses. Each pair 
gives the respective ranks of that student in statistics (X) and in 
mathematics of finance (Y). For example, the first student listed 
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ranked second in his class in statisties on the final examination, but 
ranked fifth in the final examination in mathematics of finance. 


Student 
1234 5 67 8 9 10 11 12 13 14 15 16 
X (statistics): 2 T4 1 1078 15 9 16 5 6 12 11, 19 3 174 
Y (finance): 5432 97 16 6 15 12 8 11 10 13 1 14 
Change in rank (d): 33 11 11 13 172 110 2 0 


It is seen from these data that there is a general but imperfect 
tendency for a student's grades to rank about the same in both sub- 
jects, that is, a student's grade in statistics has some relation to his 
grade in mathematies of finance. If the relationship is basically lin- 
ear, it can be measured rather simply and satisfactorily by means of 
the following formula for what is called the Spearman, or rank- 
difference, coefficient of correlation: 


6X(d?) 
n(n? — 1) 
where d is the difference between successive pairs of ranks (in the 
above illustration) or, in general, between the ranks of X. ¿and Yj, i 


varying from 1 ton. For the data on ranks in statistics and in mathe- 
matics of finance, 


(7.51) =Í 


dı = 2 — 5 = —3, dg =7—4= +3, --, dis = 14 — 14 = 0; 


hence, X(d?) = 92, and r, 2 1 — 553/15 (224) = 0.865. 

If there are ties for ranks, each X (or Y) so tied is given the mean 
of the ranks involved in the tie. For example, if two X's are tied for 
ranks 1 and 2, each X is given a rank of 1.5; if three Y's are tied 
among themselves for ranks 1, 2, and 3, each is considered to have 
rank 2. 

It can be shown that r, never has a size outside the range —1 to 
+1, regardless of the types of measurements involved or their sizes. 
Tt is seen from formula 7.51 that, if each Y has exactly the same rank 
às its corresponding X, all of the d's are 0 and hence X(d?) = 0 and 
T, — l. If the ranks are perfectly reversed (1 with 16, 2 with 15, 
eto.) r, = —1. 

Kendall discusses such matters as confidence intervals for rank- 
correlation coefficients in his book (reference above) as well as intro- 
ducing the coefficient tau (+), which he prefers to the Spearman co- 
efficient, rẹ These matters will not be discussed further here, but 
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the reader again is invited to consult Kendall's book on this subject 
if interested. 


PROBLEMS 


1. Solve problem 5 of section 7.4, using the Spearman coefficient, Te 
2. Solve problem 6 of section 7.4, using the Spearman coefficient, r,. 


Ans. r,=-—1. 
3. Compute r, for the data of problem 9, 


section 74, letting X — 1 for 1929, 

X = 2 for 1930, ete., and setting Y — wage. 

4. Compute the rank-difference coefficient of linear 
of observations in Table 7.22. 

5. Make up a problem for which r= +1 
Then make up another set for each case diff 

6. A sampling study in cereal che: 
linear correlations: 


correlation for the pairs 
Ans. r, = 68. 
, also for 7, = —1, and r, = +5. 
erent from each of the others. 
mistry gave the following product-moment 


Sample 1: n = 44, ry = —0.93 
Sample 2: ng = 44, r2 = —0.81. 
Test Hop, = po) and draw appropriate conclusions, 


Ans. y = 240; PX 017; reject Ho. 
d the sample size become and still 
o at the 5 per cent point if the 7’s stayed the same 


7. Referring to problem 6, how small coul. 
result in the rejection of H 
Size? 


8. If 7, = —93, as in problem 7, could TQ = 


and n, have to be if they were equal? Ans. Ny = n, = 225. 


9. It has been stated that each of the ratios (b — B) /sy and == 
V — S — 2) 


follows the ¢-distribution with 
with a given n. Show that these two 
= 0; and hence that testing Hy (B= 

10. Suppose that the following re 
volving different methods of some E 


quantities are algebraically identical if 8 
0) is identical to testing Ho (p = 0). 
sults were obtained from 


two samples (in- 
ort), each containing 20 ol 


bservations: 


Method 1: ry = 40, Ho (o1 = 0) accepted. 


Method 2: ry = 60, Ho (pa = 0) rejected, P < .01. 


Yet Ho (p = po) is accepted readily because P > 40. 


Explain how such results are not contradictory, Also, determine what sizes n 
must have in order that each of these thr 

5 per cent point if the correlations stay as they are. 

Ans. First Hy, n, = 25; second Ho, n, = 11; third Ho, n, = n, = 109. 

You are given the following two-variable frequency distribution table as the 

basis for solving problems 11 to 15 below. These data are derived from records 

of heights and weights of 9-year-old Kansas girls in certain schools. These data 
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were obtained from the Department of Home Economics, Kansas Agricultural 
Experiment Station, through the courtesy of Dr. Abby Marlatt. 


HEIGHT IN CENTIMETERS (X) 


Weight 

in kilo- 123 127 131 135 139 143 147 151 

grams to to to to to to to to 

(Y) 126 130 134 138 142 146 150 154 

650-689 1 
610-649 1 
570-609 3 0 
530-569 0 4 
490-529 1 4 2 3 
450-489 1 4 9 4 
410-449 1 3 4 12 6 
370-409 1 5 6 14 13 5 
330-369 1 2 9 23 17 14 4 
290-329 1 1 17 25 32 13 4 
250-289 1 15 32 19 5 
210-249 5 8 5 2 

Mean 


weight 246.6 259.9 285.4 306.8 336.5 378.7 409.90 458.1 
Standard 


deviation 29.1 27.2 35.6 38.6 57.0 109.6 82.8 


11. Plot the mean weights above against the midpoints of the height classes, 
and decide therefrom if the assumption of a linear rel 
two variables seems acceptable. 

12. Ignore any indication of non-linearity of trend and compute r and b by 


the methods of this chapter. What conclusions can you draw from these 
estimates? 


ationship between these 


Ans. b = 077, r = 72. 

13. Compute the standard deviation not given in the above table by the 
method of Chapter 2 adjusted so as to take account of the fact that this is 
Supposed to be a sample. 

14. Each height class has some kind of one-variable frequency 
of the weights within the height class. Hence the above dat 
Several samples of weights within height classes. Theoretically, these weight 
distributions within height classes must have equal population variances. Plot 
the standard deviations against the midpoints of the corresponding height 
classes and decide therefrom—if you can—whether or n 


distribution 
a constitute 


ot that is a good as- 


15. For the weight class 290 to 329 kilograms compute the coefficient of varia- 
tion for the heights, taking the point of view maintained in Chapter 2, 
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REVIEW PROBLEMS 


1. Define the term percentile and explain how it can be associated with the 
relative cumulative frequency distribution of a group of measurements. 

2. Caleulate the arithmetic mean, standard deviation, and coefficient of varia- 
tion for the following data on the carotene content of pasture grasses, in milli- 
grams per gram: 


X: 022 0.13 023 036 044 026 0.11 023 026 
0.26 0.20 0.16 and 020 


3. What is the median carotene content for the data of problem 2? The 
mean deviation? 


4. Compute the geometric mean of 32 and 90. Of 2, 7, and 30. Of the X's 
of problem 2. Ans. 53.8; 7.49; 0.22. 

5. You are given the following information regarding a group of 2000 weights 
(in grams): » = 800, md = 700, Q, = 500, Q4 = 900, extreme weights are 350 
and 1300; and the upper limits of the 15th, 35th, 80th, and 90th percentiles are 
400, 540, 950, and 1050, respectively. Sketch a graph of the r.c.f. curve. 

6. Given the following scores made on an Ohio Psychological Test, construct 


a frequency distribution with equal class intervals and compute » and ø. These 
scores are necessarily integers. 


83 69 30 26 53 60 44 36 68 71 55 52 45 62 42 47 
70 62 28 46 42 45 38 45 75 79 73 105 80 81 68 65 
48 52 38 77 26 71 31 24 51 55 67 41 36 67 106 37 
60 48 74 98 62 33 83 108 74 35 38 35 38 112 66 85 
48 44 100 55 77 78 21 94 35 75 71 69 61 50 70 47 
65 103 100 70 60 30 97 86 54 71 87 68 64 54 45 30 
52 49 78 51 91 63 45 46 90 42 68 34 79 76 39 38 
64 46 34 43 57 76 31 60 34 105 17 831 67 73 53 99 
68 54 37 99 48 24 50 58 104 64 54 38 96 53 57 35 
52 73 66 39 59 70 91 88 60 44 82 72 56 76 71 30 
59 50 100 77 129 46 86 88 36 78 61 58 40 37 65 72 
103 63 46 70 48 48 57 83 51 29 51 32 37 100 43 47 
53 41 107 115 64 59 26 48 40 61 37 70 49 62 88 42 
69 49 71 57 87 63 101 69 50 75 69 48 59 49 96 67 
63 71 75 56 78 40 81 59 74 110 57 28 50 68 63 55 
61 30 95 116 75 71 31 34 77 60 84 68 70 36 65 27 
608 49 41 79 66 78 58 99 98 79 89 27 87 37 48 75 
80 109 43 46 91 77 601 44 58 53 45 87 96 64 84 87 
116 35 105 43 75 22 37 49 56 60 74 38 38 28 57 29 
of ;84 .01 27 62 71,53 44 88 76 61 45 45 41 183: 157 
58 83 82 67 75 29 71 77 50 47 102 83 47 64 57 75 
94 38 38 107 65 25 51 28 53 80 79 55 47 57 76 49 
92 32 39 89 70 52 34 41 31 77 57 44 56 41 39 42 
81 70 68 69 80 48 46 38 83 65 33 57 14 42 32 78 
51 55 50 652 75 57 65 74 40 63 44 59 38 00 64 35 
50 65 37 76 82 100 48 69 47 54 33 35 61 74 37 37 
35 42 128 35 47 57 59 46 91 80 81 78 74 53 39 66 
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58 63 40 55 46 46 40 38 58 63 32 42 56 30 85 50 
41 74 43 55 93 33 60 72 54 Sl 66 56 36 60 92 39 
31 81 41 38 28 62 51 S6 38 6l 48 85 53 82 26 32 
48 46 40 51 54 28 66 72 48 75 69 69 82 56 30 57 
96 87 63 43 45 38 82 43 62 31 66 80 97 78 36 60 
91 97 59 40 45 78 89 28 67 79 53 82 37 98 56 68 


66 33 36 43 80 72 51 54 30 34 36 77 54 03 66 45 
29 29 59 70 83 45 108 78 37 48 36 33 97 43 58 89 
60 67 55 64 72 99 91 75 46 52 59 39 18 54 91 76 
29 63 95 41 28 45 44 94 57 34 86 36 36 69 55 58 
67 86 82 42 48 62 109 48 Sl 


Ans. y £2 59.7; c = 21.9. 
7. Construct an r.c.f. curve for the data of problem 6 and obtain from it evi- 
dence regarding the normality of the distribution of these test scores. 
8. In what percentile would a person who made a score of 101 rank in the 
test of problem 6? Ans. 97th. 
9. What are the modal and the median test scores, respectively, for prob- 
lem 6? 
10. Caleulate the median for the Ohio test scores of problem 6 by grouping 
them into about 12 classes of equal length. Ans. md = 51.6. 
11. The Year Book of Labour Statistics for 1943-1944 gives the following per- 


centages of unemployed in the United States and Sweden during 1941, by 
monthly averages: 


U.S.: 


15.3 14.0 12.1 8.8 5.3 2.9 2.2 1.9 
Sweden: 17.1 16.4 15.1 13.1 10.6 9.3 7.8 7.5 
U.S.: 0.5 1.3 3.7 5.3 
Sweden: 7.3 8.2 10.0 13.0 


In which country was the level of unemployment relatively more stable during 
that year? Justify answer statistically. 

12. The USDA publication, Agricultural Statistics, 1946, lists the following 
tax levies for the 48 states, in dollars per acre: 


1.01 0.89 0.59 2.73 1.77 2.19 1.06 2.26 1.05 0.73 
0.76 1.06 0.54 0.96 0.83 1.18 0.33 0.24 0.24 0.37 
0.40 0.33 0.81 0.27 0.16 0.40 0.24 0.18 0.62 0.37 
0.43 0.22 0.45 0.32 0.33 0.25 0.15 0.11 0.52 0.06 


0.20 0.04 0.08 0.35 0.17 0.43 0.27 1.00 


Compute the median tax per acre from an array of these data. Also compute 
the range and the midrange. Compare the latter with the median and draw 
any possible conclusions. Ans. md = 40; range = 2.69; MR = 138. 

13. Referring to the data of problem 12, in what decile would a state rank 
if its tax rate were 0.40 dollars per acre? What percentage of the states would 
have a higher rate? 

14. Suppose that a sample of 15 differences in yield between two varieties of 
corn grown side by side on 15 pairs of plots has been found to have an arithmetic 


232 LINEAR REGRESSION AND CORRELATION Ch. 7 


mean = 10 bushels per acre, with the standard deviation = s = 13 bushels per 
acre. Is one variety superior to the other or not? Give reasons for answer. 
Ans. t S 3.00; 14D/F, P = 01. 
15. Suppose that a large number of tractor gears has been produced and that 
90 per cent of them are classifiable as acceptable. If a sample of 10 Gears is 
taken at random from this group, which of the following is more likely to 
occur? (a) The sample will contain less than 90 per cent “acceptable” and 
will, therefore, give a pessimistic picture of the quality of the whole batch. 
Or (b) the sample will contain at least 90 per cent “acceptable” and hence 
will, if anything, overestimate the quality of the batch. 
16. The depth of deterioration (in inches) is used as an index of the mer- 
chantable volume of timber remaining in fire-killed Douglas fir. Kimmey and 
Furniss (USDA Tech. Bull. 851) report a study made in western Oregon and 


Washington on such timber. The following data on old-growth trees were read 
from one of their graphs: 


Years after fire: 5 10 15 20 25 30 35 40 45 
Depth in inches: 2.2 3.6 4.8 6.3 7.7 9.1 10.8 12.2 13.8 
Years after fire: 50 55 60 

Depth in inches: 15.5 17.5 19.8 

W. 


hat do you conclude from a regression analysis is the average increase in 
depth of deterioration per decade? Given EXY = 5123.0, XY? = 1617.09. 
Ans. 3.0 inches = point estimate; CI,;: 0.25 8 £ 0.35. 
17. Economists sometimes speak of commodities with elastic or inelastic 
prices, meaning generally that a commodity which is slow to change price in 
the face of changes in demand has an elastic price. If you adjust prices for 
inflation and for depression, and if demand is measured by per capita consump- 
tion of a given commodity, the definition of an elastic, or inelastic 
be made more specific. For example, if the slope of the linear trei 
lating adjusted price (Y) to consumption per eapital (X) is less 
the price can be called elastic. If B is greater than 1, the prie 
inelastic. Given the following data regarding w 
the price be classified as elastic according to tl 


, Price can 
nd line re- 
than unity, 
e then is called 
hole milk and cream, would 


he above definition after due 
allowance for sampling error? 
Adjusted price 
(3/cwt) 1.88 2.06 2.07 2.26 2.36 2.29 2.48 2.19 
Consumption 


per capita (cwt) 3.43 3.50 3.72 3.93 4.12 4.32 4.20 4.00 


18. In the preceding problem a definition of an elastic price was 
the size of the regression coefficient, 8. What 
this discussion to include the size of the correl 
19. Following are adjusted farm beef prices (per hundredweight) and con- 
sumption per capita (hundredweight) for the 10-year period indicated: 
Year: 1987 1938 1939 1940 1941 1942 1943 1944 1945 1946 
Price: 6.67 6.67 7.52 7.79 8.32 8.63 8.62 7.94 8.71 9.06 
Con- 
sump- 


tion: 0.55 0.54 0.54 0.55 0.60 0.61 


y based upon 
information would it add to 
ation coefficient? 


0.53 0.55 0.59 0.61 
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Compute b and r, describe the price as elastic or inelastie, and bring the size of r 
into your diseussion as suggested in problem 18. 
20. Given the following data on turkeys, solve as in problem 19: 


Year: 1930 1931 1932 1933 1984 1935 1936 1937 


Price (cents/Ib) Y: 15.9 18.4 14.8 13.8 16.1 20.1 15.4 17.2 
Consumption (Ib) X: 1.8 UF 2.1 2.4 2.2 2.1 2.7 2.7 


Year: 1938 1939 1940 1941 1942 1943 1944 1945 


Price (cents/Ib) Y: 17.9 16.5 15.9 18.8 22.2 23.6 25.0 24.2 
Consumption (Ib) X: 2.7 3.0 3.5 3.5 3.7 3.3 3.3 4.8 


Year: 1946 1947 


Price (cents/Ib) Y: 22.6 18.7 
Consumption (Ib) X: 4.5 4.5 


Given: EX? = 175.38, ZXY = 1041.46, =(y*) = 203.7361. 
Ans. Clos: 1.0 € 8 < 8.4, r = .58. 


21. Referring to problems 19 and 20, were the beef or the turkey prices rela- 
tively more stable during the period 1937 to 1946? Give statistical evidence 
for your answer. 

22. The earliness with which chickens obtain their feathers is economically 
important to persons who raise broilers because it affects the rapidity and 
cleanness of dressing. Early feathering, a sex-linked characteristic, is chiefly 
dependent upon one gene locus on the sex chromosome. Its inheritance can 
be described diagrammatically as follows: 


L 
1. = late-feathering female, 
none 
L 
2. = a i 
— early-feathering female, 
L 
3. 
L or 
L = late-feathering male, 
4. 
t 
» l = early-feathering male. 
expo dde feathering females are mated to early-feathering males, what is the 
featherin gi of: (a) late-feathering females among the offspring, (b) late- 
5 males, (c) carly-feathering chicks of either sex among 1000 offspring? 
23. Sup Ans. (a) none; (b) 500; (c) 500, all females. 


Pose that the late-feathering males in a flock can be assumed to have 


two-thir ` 
hirds of type Ll and one-third of type LL. If these males are mated to 


234 LINEAR REGRESSION AND CORRELATION Ch. 7 


cuy ta lhelhe females, what is the probability that a fertile egg selected at 
random will hatch into an early-feathering chick, sex disregarded? i 
24. Referring to problem 22, suppose parents 2 and 4 are mated. Tr 4 fertile 
eggs are to be incubated and all can be assumed to produce a live chick, what 
is the probability that at least one early-feathering chick of each sex will be 
hatched so that you could hope to develop a line of early-feathering chickens? 


Ans. 55/198. 
be developed which measure a 
o particular field of subject matter 
cores derived from such a testing 
ot required—to indicate the level 
ding year. All samples have n — 20 


25. Some educators believe that tests can 
persons general ability to think critically. N. 
is involved. The following data are test s 
program. The students were asked—but n 
of their father's annual income for the prece 
and are assumed to be from normal popul: 


ations. 
Freshmen Juniors and Seniors 
Men Women Men Women 
Q 2) (8) a (23) (3) a) (2) 3) (3 
4l 44 34 43 4l 24 39 36 38 31 
31 42 39 44 36 38 35 42 47 45 
36 31 39 42 20 32 34 32 26 33 
36 31 4l 26 26 35 25 40 34 29 
26 34 36 33 25 35 27 23 45 28 
25 28 35 41 34 20 31 41 24 39 
34 29 36 41 38 26 44 31l 39 25 
24 35 33 28 42 32 45 32 43 41 
28 2 21 35 33 2 48 35 35 44 
29 34 31 28 34 3 39 40 33 18 
23 (88 183 283 28 31 37 36 31 35 
37 27 26 24 34 . 33 29 37 27 46 
35 20 28 30 20 22 36 32 28 38 
32 42 35 36 45 24 35 45 29 49 
33 28 30 38 37 35 41 32 32 30 
35 36 32 38 32 29 35 43 33 42 
30 34 30 31 35 28 26 38 37 40 
49 38 2 45 40 21 39 44 32 37 
38 42 2% 40 24 3 37 35 40 36 
34 26 39 37 34 41 43 46 38 13 
z: 33.0 33.4 32.2 35.4 32.9 


29.9 36.2 37.0 34.6 


(1) = not over $5000. (2) = $5001 to $10,000. 


(3) = none stated. 
Use any or all of these 


below. 
Do the two samples fo, 


n whose fathers earned not 

00 in y think more critically than 

freshmen men, if this test is assumed to be reliable? 
26. Referring to problem 25, how 

Senior men whose fathers are in the lower incom 


i © group? Show how the G-test 
helps answer this question. Ans. G = 0.13: P Oa aow; the is 


oF 
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27. Solve as in problem 26 for the higher income group of fathers. 

28. You might wonder if the group which refused (or neglected) to reveal 
the fathers income are from a different population as regards scores in critical 
thinking. Make a study of this matter for freshmen women by means of the 
G-test. Ans. Low income vs. undeclared: G = 0.26, P < 010. 

High income vs. undeclared: G = 0.09, P» .10. 

29. Do freshmen women from the higher income group belong to a population 
with a lower mean score than that of the freshmen women from the lower 
income group? Use the G-test. 

30. Solve problem 29, using the ¿-test instead of the G-test. (See other tables 
for these D/F.) Given: =z? for (1) = 804.80; for (2) = 953.80. 

Ans. t = 1.16, 38 D/F, P > 05. 

31. Use the G-distribution to set a 90 per cent confidence interval on the true 
mean score of freshmen men whose fathers make $5000 or less per year. Does 
this interval make it possible to test the hypothesis Holu = 0)? 

32. Compute CIso's on the true means of the populations sampled by columns 
4 and 6 (from left) in the table above, and draw all appropriate conclusions. 

Ans. Col. 4: 332-54, =376. Col. 6: 27.7 = p, £ 32.1 by G-distribution. 

The following data record the thiamin-content, in micrograms per gram of 
meat (dry fat-free basis) in raw pork loin after various periods of storage (tem- 
perature not over 10°F). (These data made available through the courtesy of 
Dr. Beulah Westerman, Department of Foods and Nutrition, Kansas State 
College.) 

PERIOD or STORAGE 


(Weeks) 
0 12 24 40 56 72 


126.88 81.47 91.51 104.78 76.99 93.57 
98.83 69.14 69.58 69.04 79.22 94.25 
106.55 119.44 98.17 84.84 74.28 114.03 
91.73 75.65 81.49 105.20 121.34 99.65 
68.35 65.41 77.05 70.06 83.58 88.49 
95.41 111.89 102.43 111.17 97.14 77.19 
111.67 80.93 91.87 86.30 97.21 116.62 
78.30 76.94 88.62 71.01 72.91 87.38 
118.50 111.26 102.31 100.85 65.03 91.94 


Mean 99.58 88.01 89.23 89.25 85.30 95.90 
range 658.53 54.03 25.38 42.13 56.31 39.43 


33. Compute through the G-distribution the CIgg on the true thiamin con- 
centration (micrograms per gram) in raw, unstored, pork loin produced under 
the conditions maintained during the sampling which produced the above data. 
Draw all appropriate conclusions. 

34. Solve as in problem 33 for raw pork loin stored 12 weeks. 

Ans. Clyp: 76.9 — u — 99.1 micrograms per gram. 

35. Does cold storage (at, or below, 10°F) of raw pork loin for 12 weeks re- 
duce the thiamin concentration, according to the evidence from the above 
data and the G-test? 
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36. Does the concentration of thiamin in raw pork loin increase between the 
fifty-sixth and the seventy-second week of cold storage (at, or below, 10°F), 
or is the observed average increase of 10.60 micrograms per gram probably just 
a sampling accident? Ans. G = 0221, n = 9, P > .10; sampling accident. 

37. It appears from the sampling data above that the thiamin concentration 
in raw pork loin decreases during the first 12 weeks of storage, stays about the 
same through the fifty-sixth week of such storage, and returns to about the 
original concentration by the end of the seventy-second week of storage. Is 
this actually the case, according to G-tests, or could the observed results rea- 
sonably be assigned to sampling error? 


The following data are from the same source as those immediately above, 
and were taken during the same general experimentation. They record the 
riboflavin concentration in raw pork loin instead of the thiamin content just 
studied, These data are to be employed in the solution of problems 38 through 
43. 


PERIOD or STORAGE 


(Weeks) 

0 12 24 40 56 72 

3.42 4.31 5.98 5.17 4.08 5.39 

2.86 3.52 4.84 4.19 3.22 5.02 

2.99 3.47 5.14 4.87 4.03 5.51 

2.24 3.47 4.72 4.55 4,19 5.03 

2.02 3.43 4.52 4.58 3.35 4.25 

2.17 4.07 2.91 4.28 3.35 4.18 

1.69 3.52 3.73 4.33 5.23 4.80 

2.09 3.48 3.61 4.29 4.91 3.51 

1.57 3.84 3.60 5.14 5.80 4.63 

Mean 2.34 3.68 4.34 4.60 4.24 4.70 
range 1.85 0.88 3.07 0.98 2.58 2.00 


38. Make a scatter diagram with Y — mean riboflavin concentration and 
X = weeks of storage. Is the trend in the bivariate population of X's and Y's 
probably linear for these times of storage? 

39. Would the above data cause you to accept, or to reject, the hypothesis 
that the riboflavin concentration in raw pork loin is increased by 12 weeks of 
storage at, or below, 10°F? 


40. Can the apparent drop in riboflavin concentration between the fortieth 
and the fifty-sixth weeks of storage reasonably be assigned to sampling acci- 
dents? 


Ans. G = 0.202, n=9, P> 10; a 
41. Use the t-distribution to set a 95 per cent confidence 
riboflavin concentration in raw, unstored, pork loin of t 
You are given that EX = 21.05, and EX? = 52.3121. 


42. Solve problem 41 by means of the G-distribution, and compare the result 
with that obtained from the t-distribution. 


ccept Holuso = uso). 
interval on the true 
he kind sampled here. 


Ans. From G: 19-4, —28; from t: 18 =p) =29. 
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43. According to the evidence above, 40 weeks might be an optimum storage 
period for inereasing riboflavin. The means for 0 and 40 weeks differ by 2.36. 
Use the G-distribution to place a CIg; on the true gain due to 40 weeks of 
storage, and draw conclusions. 

44. A recessive lethal will destroy an organism only if carried by both chromo- 
somes of a pair. Suppose that l, is such a lethal, dnd that the following mating 
has been made: L,l X L,lj. What is the probability that among the first 10 
offspring none will be killed by this lethal? Ans. 057. 

45. Suppose that a flock of chickens carries the lethal mentioned in problem 
44, and that the owner wishes to so select his future breeding stock that this 
lethal will disappear as rapidly as possible from his flock. He knows that some 
of his chickens are carriers, that is, are L,lj. New stock which he raises cannot 
be designated as L,L, or as L,I, until they have produced some (perhaps many) 
offspring. Hence new members of the flock will be mated to known Ljl,'s and 
then will be eliminated from the flock if any of their offspring are victims of 
the lethal because this will show that they are carrying that gene. How many 
offspring should the owner see from a chicken without the appearance of the 
lethal before accepting that chicken as being L,L, and hence not a carrier of 
the lethal? Since he never can be absolutely positive, assume that he is willing 
to run a risk of 1 in 50 of reaching such a conclusion erroncously. 

46. Suppose that a trait which is of economic interest to a sheep breeder is 
determined by two genes, R and S, believed to be carried on two different 
chromosomes. It also is believed that R is completely dominant to r and 
similarly for S with respect to s. It is supposed that only those animals showing 
both dominant characteristics are of special interest. If the breeder's hypotheses 
are correct, the mating RrSs X RrSs should produce 9/16 of its offspring with 
both the R and the S genes, 3/16 with R but not S, 3/16 with S but not R, and 
1/16 with neither R nor S. Suppose that all four possibilities are distinguishable 
and that the following offspring have been recorded: 


82 are R and S (called RS); 36 are but not S (called Rs); 28 are S but not R 
(called 7S), and 14 are neither R nor S (called zs). 


Given these results, would you accept the hypothesis stated above, namely, 
H¿(9RS:3Rs:3rS:1r5)? Give reasons. 
Ans. x? = 3.644, 3 D/F, P » 11; accept Ho. 
47. What is the probability that both of two ClIy;’s on obtained from two 
random samples from the same normal population will include 4? Since the 
# would lie in the overlap of these two intervals (if both did include 4), and 
Since this overlap would be shorter than either interval in many cases, and 
never longer, would you do a better job of estimating u by using two random 
samples and considering this overlap? Would the probability of an error of 
the first kind be reduced if this process were used to test an Hy? Give reasons 
for answers. 


The following numbers are measurements of basal metabolism (in calories/ 
square meter of surface area/hour), and are to be used in answering problems 
48 to 53 below. These data were derived from measurements provided through 


238 LINEAR REGRESSION AND CORRELATION Ch. 7 


the courtesy of Mrs. Ada Seymour and the Department of Home Economics, 
Kansas Agricultural Experiment Station. All ages are to the nearest birthday. 


Clos on y 
Age Class n byt mean, Z Sa 
10-11 45 42.96-44.72 43.84 0.441 
12-14 46 37.27-39.33 0.516 
15-16 52 34.50 0.350 
17 65 33.69-35.11 0.354 
18 90 32.59-33.77 0.295 
19 91 32.05-33.70 
20 73 32.67-34.00 0.333 
21-25 175 32.82 0.185 
26-29 55 31.66-33.01 0.338 
30-34 73 31.90-32.92 32.41 
35-39 57 32.36-33.82 33.09 0.362 
40-44 53 32.00-33.38 0.346 
45-49 56 30.75-31.97 0.304 
50-59 62 30.67-32.03 0.341 
60 and over 33 30.06-32.12 


48. Fill in the two CI 


95 S omitted above and state what information they 
yield. 


49. Graph the CI,;’s versus age (on the horizontal axis) so as to produce a 
figure from which you could read, approximately, the confidence interval on 
true mean basal metabolism for any age, with a confidence coefficient .95. This 
is to be applied only to Kansans, of course. 

50. Compute the two missing standard deviations in the above table. 

51. Test the hypothesis that Kansas women between the ages of 35 and 39 
have a higher average basal metabolism than those in the age interval from 
30 to 34 years. 

52. According to the “Mayo Foundation Normal Standards,” 
July of 1936 in the American Journal of Physiology, the mean basal for 17-year- 
old females is 37.82 calories per square meter per hour. According to the table 
above do the Kansas girls fit that norm, or do they probably have a lower 
average metabolism rate? How confident can you be of your answer when 
allowance is made for sampling error in the above table, but none is allowed 
for the Mayo Standard? 

53. Assuming that the records for those persons in the age group 21 to 25 
years are normally distributed, estimate the range for this sample of 175. 

54. Suppose that 147 freshmen, 18 years of age, have taken a test designed 
to measure their ability to think critically, and have taken this test at the 
beginning and also at the end of their freshmen year. Their progress during 


the year is measured by the difference between these two scores. Given that 
ZY = 712 and Zy? = 5567.40, test the hypothesis that freshmen of the sort so 


sampled make some improvement in critical thinking during the year in so 
far as this is measured by the test administered. Consider that £ has 30 D/F. 


published in 


Ans. t = 9.67, P nearly zero; y ¥ 0. 
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55. Suppose that two varieties of corn have been grown at the same experi- 
mental farm during the same year, and that the following plot yields, in pounds, 
have been obtained: 


No. 1: 121 128 152 140 135 136 143 129 13.9 and 147 
No. 2: 146 129 156 143 148 134 138 153 16.0 and 145 


These field weights have been corrected for moisture content so that the variety 
yields per acre can be compared directly with these data. Use the G-test to 
test the hypothesis Ho(u, = u5), where the ws are the true means of the 
varieties. 

56. The following data simulate those which might be obtained from an 
experimental comparison of the effectiveness of two fertilizers on the yield of 
orange trees in pounds per tree: 


Nitrogen (N): 74 89 90 72 78 76 84 79 81 76 and 80 
N + Potash: 103 102 97 80 87 92 91 78 83 89 and 92 


The two groups of trees (one for N and the other for N + P) were assumed 
with good reason to be on equivalent areas of land before the two fertilizers 
were applied. Test the hypothesis that the addition of potash does not affect 
yield. Ans. G = 0488, n = 11, P = .002; reject hypothesis. 

57. Referring to problem 56, use the G-test to place a 92 per cent confidence 
interval on the true difference in average yield produced by adding potash 
under these circumstances, and draw appropriate conclusions. 

58. The following numbers are the pounds of tobacco per acre yielded, on the 
average, in the United States during the years indicated. Make a scatter 
diagram and decide if the trend toward increasing yield can be reasonably 
considered as linear if this is taken to be a sample. 


Year: 1932 1933 1934 1935 1936 1937 1938 1939 
Yield: 725 789 852 905 807 895 866 940 
Year: 1940 1941 1942 1943 1944 1945 1946 1947 
Yield: 1036 966 1023 964 1116 1094 1182 1142 


59. Referring to problem 58, again assume that this is a sample from a 
bivariate population and compute, and interpret, the CIgọ on £, the true slope 
of the regression line. 

60. Solve as in problem 59, after substituting p, the true coefficient of linear 
correlation, for 8. Ans. Clg: 86=p = 98. 

61. The following data are the numbers of sugar-maple trees tapped each 
Year and the resulting pounds of sugar and sirup. If these data can be re- 
Earded as a sample, did the production per tree change during this period in 
any orderly manner; and, if so, how? 


Year: 1929 1930 1931 1932 1933 1934 1935 
"Trees 

(1000’s): 12,951 13,158 12,092 12,064 12,009 12,099 12,341 
Pounds 

(1000's): 3724 5856 3589 3748 3269 3488 4673 
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Year: 1936 1937 1938 1939 1940 1941 
"reos: 11,500 11,339 11,380 10,313 9,957 9,785 
Pounds: 3122 3276 3475 2881 3031 2384 
Year: 1942 1943 1944 1945 1946 1947 
Trees: 9,847 9,281 8,081 7,836 8,000 8,508 
Pounds: 3569 3133 3133 1228 1700 2344 


62. The cumulative and r.c.f. distributions given below are those of the sizes 
of peach orchards in the Sandhills of North Carolina during 1946. (Data from 
Technical Bulletin 91, North Carolina Agricultural Experiment Station.) Do 


the sizes of these orchards follow a normal frequency distribution quite well, 
or is their distribution far from normal? 


Number Number 
of Trees of Trees 
in Cumula- in Cumula- 

Orchard tive f ref. Orchard tive f raf: 
200-299 12 .03 200-1999 121 .37 
200-399 23 .06 200-2999 153 47 
200-599 47 .13 200-4999 188 .60 
200-799 64 AF 200-9999 225 -79 
200-999 79 .22 all orchards 257 1.00 


63. Referring to problem 62, what is th 


e median size of orchard? The lower 
limit of the second quartile? 


The following scores on certain academic aptitude tests and the student's 
grade point average (GPA) at the end of the indicated year are to be the 
basis for answering the questions in 


! problems 64 to 70, inclusive. 'These data 
constitute samples from classes taking a natural science comprehensive course 
at Kansas State College. 


FRESHMEN 


ACE-T ACE-L ACE-Q GPA ACET ACE-L ACEQ GPA 


66 33 33 0.11 85 48 37 0.56 
101 57 44 0.96 89 53 36 0.68 
85 50 35 1.33 100 53 47 0.56 
96 56 40 1-1 122 67 55 1.03 
115 66 48 1.30 117 74 43 2.33 
110 74 36 2.06 96 64 32 2.31 
111 70 41 0.06 90 58 32 0.93 
62 39 23 1.50 103 63 40 1.58 
74 48 26 0.77 41 26 15 0.04 
116 - 85 31 2.64 68 42 26 1.04 
102 62 40 1.14 125 74 51 2.24 
113 69 44 0.81 111 71 40 0.77 
105 62 43 0.48 87 64 23 1.27 
81 49 32 1.22 100 65 35 0.59 
113 61 52 2.59 114 71 43 2.19 
147 101 46 2.54 99 59 40 1.68 
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FRESHMEN (Continued) 


ACE-T ACE-L ACE-Q GPA ACE-T ACEL ACE-Q GPA 
93 58 35 0.97 115 60 55 1.39 
59 31 28 0.50 77 46 31 0.50 
75 52 23 0.83 89 53 36 1.27 

106 63 43 1.24 137 81 56 1.38 
37 27 10 0.41 42 30 12 0.96 
139 72 67 1.97 125 67 38 1.41 
126 80 46 1.83 
Juniors 
115 65 50 1.42 132 71 61 1.00 
100 55 45 0.85 109 58 51 1.95 
107 55 52 1.12 129 83 46 1.93 
108 72 36 1.64 87 50 37 1.79 
115 64 51 0.81 80 45 35 1.35 
83 46 37 0.86 110 70 40 2.08 
121 71 50 1,11 96 4T 49 1.60 
82 54 28 1.18 122 70 52 1.93 


64. Make scatter diagrams of the total ACE scores (ACE-T's) on the hori- 
zontal axis and the GPA's on the vertical for freshmen and also for juniors, 
using the same coordinate system but different symbols for the two classes. 

65. After solving the preceding problem, explain why you agree or disagree 
with each of the following statements: 


(a) For freshmen, you would expect to find a positive and useful linear cor- 
relation between these two variables; but there also are other important factors 
affecting the grade point average of a college student. 

(b) For the juniors represented by this sample, there is little, or no, relation- 
ship between the ACE-T score and the grade point average. 

(c) The freshmen and the juniors fit the same general relationship between 
ACE-T and GPA; the persons with especially low ACE-T scores simply have 
been eliminated by the time of the junior year. 

(d) Given that for freshmen the linear correlation between GPA and ACE-L 
Score is .6, whereas that between GPA and ACE-Q is only .4 for these samples, 
it is concluded that whatever is measured by the L-score definitely is more 
important than whatever is measured by the Q-score. 


66. Make a scatter diagram for the ACE-L scores of freshmen against their 
GPA's. And then do likewise for the juniors, using the same coordinate axes. 
Draw appropriate conclusions. 

67. Solve as in problem 65, parts (a) to (c), but use the results of problem 
66 and change ACE-T to ACE-L wherever used. 

68. Compute the Spearman rank-difference correlation, r,, for each scatter 
diagram of problem 64. Then consider problem 65 in the light of these cor- 


relations. Ans. For freshmen, 7, = 59; for juniors, r, = .14 
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69. Compute the Spearman coefficient of linear correlation for each scatter 
diagram of problem 66 and then solve problem 67 in the light of these results. 

70. Make a scatter diagram for the freshmen and for the juniors, as in prob- 
lem 64, but use the ACE-Q scores. Draw all appropriate conclusions. 
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TABLE I 
SQUARES, SQUARE ROOTS, AND RECIPROCALS 

n Vn Vin mw? 1/n n Vn Vi0n x l/n 

0.1 0.32 1.00 0.01 10.000 4.6 2.14 6.78 21.16 .217 
0.2 0.45 1.41 0.04 5.000 4.7 2.17 6.86 22.00 213 
0.8 0.55 1.73 0.09 3.333 4.8 2.19 6.93 23.04  .208 
0.4 0.63 2.00 0.16 2.500 4.9 2.21 7.00 24.01 204 
0.5 0.71 2.24 0.25 2.000 5.0 2.24 7.07 25.00  .200 
0.6 0.77 2.45 0.36 1.667 5.1 2.20 7.14 26.01 196 
0.7 0.81 2.65 0.49 1.429 5.2 2.28 7.21 97.04 192 
0.8 0.89 2.83 0.64 1.250 5.3 2.30 7.28 28.09 ‘189 
0.9 0.95 3.00 0.81 1.111 5.4 92.32 7.35 29.16 185 
1.0 1.00 3.16 1.00 1.000 5.5 2.35 7.42 30.95 182 
1.1 1.00 3.32 1.21 0.900 5.6 2.37 7.48 31.30 179 
1.2 1.10 3.46 1.44 0.833 — 5.7 2.39 7.55 32.49 175 
1.8 114 3.60 1.69 0.769 5.8 2.4] 7.62 33.04 172 
1.4 1.18 3.74 1.96 0.714 5.9 2.43 7.68 34.81 169 
1.5 1.22 3.87 2.25 0.667 6.0 2.45 7.75 36.00 167 
1.6 1.20 4.00 2.56 0.025 6.1 2.47 7.81 37.21 164 
1.7 1.80 4.12 2.89 0.588 — 6.2 2.49 7.87 38.44 161 
1.8 1.34 4.24 3.24 0.556 6.3 2.51 7.94 39.69 159 
1.9 1.88 4.36 3.61 0.526 6.4 2.53 8.00 40.96  .156 
2.0 1.41 4.47 4.00 0.500 6.5 2.55 8.08 42.95 154 
2.1 1.45 4.58 4.41 0.476 — 6.6 2.57 8.12 43.50 150 
2.2 1.48 4.00 4.81 0.455 — 6.7 2.59 8.19 44.89 149 
2.3 1.52 4.80 5.29 0.435 6.8 2.61 825 40.24 “147 
2.4 1.55 4.90 5.76 0.417 6.9 2.63 8.31 47.61 ‘145 
2.5 1.58 5.00 6.25 0.400 — 7.0 2:65 8.57 49.00 143 
2.6 1.61 5.10 6.76 0.885 — 7.1 2.08 8.43 5041 14] 
2.7 1.64 5.20 7.29 0.370 — 7.2 2.68 849 51.84 “139 
2.8 1.07 5.20 7.84 0.957 — 7.3 2.70 8:54 53.29 10 
2.9 1.70 5.39 8.41 0.345 — 7.4 2.72 8.00 5478 “134 
3.0 1.78 5.48 9.00 0.33 — 7.5 2.74 8.06 5625 133 
8.1 1.78 5.57 9.61 0.8909 7.6 2.76 872 87.76 15 
$2 1.70 5,86 10.24 0.812 7.7 277 877 5020 “1 
3.3 1.82 5.74 10.89 0.203 7.8 2.79 8:83 60.84 o 
$4 L8 5:63 11.0 02% 71528 239 dal Ue 
SE 1:87 5:02 12.25 0.200 810 283 80m 60 Zu 
$5 100 B 1206 0.7% $1 zs 00 um ^13. 
$T 10% 8,08 18.00 0.20 32 280 g eo 1 
P5 DES I$ 1444 DDS g3 28 on w Cm 
$9 LN 93€ 16:21 0.206 sa 290 $17 mes TA 
4.0 2.00 6.32 16.00 0.250 8.5 2.92 93 72.95 118 
au Pos D 1H USM 45 28 gm Qi 5 
$2 205 B49 17.64 0.238 $7 20 pu 75.69  .115 
tia QD Dd Us) EB o Là wa 12 
etn ce bene ss ES nas da 
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TABLE I (Continued) 
SQUARES, SQUARE ROOTS, AND RECIPROCALS 

n Vn Vin è 1/n n Vn Wiün æ 1/n 
9.1 3. .110 13.6 3.69 11.66 184.96 .074 
9.2 3. .109 13.7 3.70 11.70 187.69  .073 
9.3 3. .108 13.8 3.72 11.75 190.44 072 
9.4 3. .106 13.9 3.73 11.79 193.21 072 
9.5 3. .105 14.0 3.74 11.83 196.00  .071 
9.6 3. .104 14.1 3.76 11.87 198.81  .071 
9.7 3. .103 14.2 3.77 11.92 201.64 70 
9.8 3. .102 14.3 3.78 11.96 204.49 .070 
9.9 3. .101 14.4 3.79 12.00 207.36  .069 
10.0 3. .100 14.5 3.81 12.04 210.25 .069 
10.1 3. .099 14.6 3.82 12.08 213.16 .068 
10.2 3. .098 14.7 3.83 12.12 216.09  .068 
10.3 3. .097 14.8 3.85 12.17 219.04 .068 
10.4 3. .096 14.9 3.86 12.21 222.01  .067 
10.5 3.2: .095 15.0 3.87 12.25 225.00  .007 
10.6 3. .094 15.1 3.89 12.29 228.01 .066 
10.7 3. .093 15.2 3.90 12.33 231.04 .066 
10.8 3. .093 15.3 3.91 12.37 234.00  .065 
10.9 3.4 .092 15.4 3.92 12.41 237.16 .065 
11.0 34 .091 15.5 3.94 12.45 240.25 .065 
11.1 3.3 .090 15.6 3.95 12.49 243.36 .064 
11.2 8. .089 15.7 3.96 12.53 246.49 .064 
TL Ss .088 15.8 3.97 12.57 249.64 .063 
11,4 3. .088 15.9 3.99 12.61 252.81 .063 
11.5 3. .087 16.0 4.00 12.65 256.00  .062 
11.6 3.4 .086 16.1 4.01 12.69 259.21  .062 
11.7 3.4 .085 16.2 4.02 12.73 262.44 .061 
11.8 38. .085 16.3 4.04 12.77 265.69 .061 
11.9 3. .084 16.4 4.05 12.81 2068.96  .061 
12.0 38. .083 16.5 4.06 12.85 272.25 .061 
12.1 3. .083 16.6 4.07 12.88 275.56 .060 
12.2 3.49 .082 16.7 4.09 12.92 278.89  .060 
12,8 3. .081 16.8 4.10 12.96 282.24  .060 
12.4 3. .081 16.9 4.11 13.00 285.61 .059 
12,5 3. .080 17.0 4.12 13.04 289.00 .059 
12,5: 8; .079 17.1 4.14 13.08 292.41 .058 
12.7 3. .079 17.2 4.15 13.11 295.84  .058 
12.8 3. .078 17.3 4.16 13.15 299.29  .058 
12.0 3. .078 17.4 4.17 13.19 302.76 .057 
13.0 3.6 .077 17.5 4.18 13.23 306.25  .057 
13.1 3. .076 17.6 4.20 13.27 309.76  .057 
13.2 3.8: .076 17.7 4.21 13.30 313.29  .056 
13.3 3. .075 17.8 4.22 13.34 316.84  .056 
13.4 3. .075 17.9 4.23 13.38 320.41 .056 
13.5 3. .074 18.0 4.24 13.42 324.00  .056 
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TABLE I (Continued) 
SQUARES, SQUARE Roots, AND RECIPROCALS 


n Vn Vim nr 1/n n Vn Vidn 1/n 
18.1 4.25 13.45 327.61  .055 21.6 4.65 14.70 .046 
18.2 4.27 13.49 331.24  .055 21.7 4.06 14.73 .046 
18.3 4.28 13.53 334.89  .055 21.8 4.67 14.76 .046 
18.4 4.29 13.56 338.56  .054 21.9 4.68 14.80 .046 
18.5 4.30 13.60 342.25 .054 22.0 4.69 14.83 .045 
18.6 4.31 13.64 345.96  .054 22.1 4.70 14.87 .045 
18.7 4.32 13.67 349.69  .053 22.2 4.71 14.90 .045 
18.8 4.34 13.71 353.44  .053 22.3 4.72 14.93 .045 
18.9 4.35 13.75 357.21  .053 22.4 4.73 14.97 .045 
19.0 4.36 13.78 361.00  .053 22.5 4.74 15.00 .044 
19.1 4.37 13.82 364.81  .052 22.6 4.75 15.03 .044 
19.2 4.38 13.86 368.604  .052 22.7 4.76 15.07 .044 
19.3 4.39 13.89 372.49 052 22.8 4.77 15.10 .044 
19.4 4.40 13.93 376.36  .052 22.9 4.79 15.13 .044 
19.5 4.42 13.96 380.25  .051 23.0 4.80 15.17 .043 
19.6 4.43 14.00 384.16  .051 23.1 4.81 15.20 .043 
19.7 4.44 14.04 388.09  .051 23.2 4.82 15.23 .043 
19.8 4.45 14.07 392.04 050 23.3 4.83 15.26 .043 
19.9 4.46 14.11 396.01 050 23.4 4.84 15.30 .043 
20.0 4.47 14.14 400.00 050 28.5 4.85 15.33 .043 
20.1 4.48 14.18 404.01 050 23.6 4.86 15.36 .042 
20.2 4.49 14.21 408.04 050 23.7 4.87 15.39 .042 
20.3 4.51 14.25 412.09  .049 23.8 4.88 15.43 .042 
20.4 4.52 14.28 416.16 049 23.9 4.89 15.46 .042 
20.5 4.53 14.32 420.25 049 24.0 4.90 15.49 .042 
20.6 4.54 14.35 424.36 .049 24.1 4.91 15.52 .041 
20.7 4.55 14.39 428.49  .048 24.2 4.02 15.56 .041 
20.8 4.56 14.42 432.64 .048 24.3 4.93 15.59 .041 
20.9 4.57 14.46 436.81 048 24.4 4.94 15.62 .041 
21.0 4.58 14.49 441.00 048 24.5 4.95 15.65 .041 
21.1 4.59 14.53 445.21 047 24.6 4.96 15.68 .041 
21.2 4.60 14.56 449.44 047 24.7 4.97 15.72 .040 
21.3 4.62 14.59 453.69 047 24.8 4.98 15.75 .040 
21.4 4.63 14.63 457.96 .047 24.9 4.99 15.78 .040 
21.5 4.64 14.66 402.25 047 25.0 5.00 15.81 
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TABLE II 
Mantissas FOR COMMON LOGARITHMS 


4 


5 


6 
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0086 
0492 
0864 
1206 
1523 
1818 
2095 
2355 
2601 
2833 
3054 
3263 
3464 
3655 
3838 
4014 
4183 
4346 
4502 
4654 
4800 
4942 
5079 
5211 
5340 
5465 
5587 
5705 
5821 
5933 
6042 
6149 
6253 
6355 
6454 
6551 
6646 
6739 
6830 
6920 
7007 
7093 
7177 
7259 
7340 


0128 
0531 
0899 
1239 
1553 
1847 
2122 
2380 
2625 
2856 
3075 
3284 
3483 
3674 
3856 
4031 
4200 
4362 
4518 
4669 
4814 
4955 
5092 
5224 
5353 
5478 
5599 
5717 
5832 
5944 
6053 
6160 
6263 
6365 
6464 
6561 
6656 
6749 
6839 
6928 
7016 
7101 
7185 
7267 
7348 


0212 
0607 
0969 
1303 
1614 
1903 
2175 
2430 
2672 
2900 
3118 
3324 
3522 
3711 
3892 
4065 
4232 
4393 
4548 
4698 
4843 
4983 
5119 
5250 
5378 
5502 
5623 
5740 
5855 
5966 
6075 
6180 
6284 
6385 
6484 
6580 
6675 
6767 
6857 
6946 
7033 
7118 
7202 
7284 
7364 


0253 
0645 
1004 
1335 
1644 
1931 
2201 
2455 
2695 
2923 
3139 
3345 
3541 
3729 
3909 
4082 
4249 
4409 
4564 
4713 
4857 
4997 
5132 
5263 
5391 
5514 
5635 
5752 
5866 
5977 
6085 
6191 
6294 
6395 
6493 
6590 
6684 
6776 
6866 
6955 
7042 
7126 
7210 
7292 
7372 


0294 
0682 
1038 
1367 
1673 
1959 
2227 
2480 
2718 
2945 
3160 
3365 
3560 
3747 
3927 
4099 
4265 
4425 
4579 
4728 
4871 
5011 
5145 
5276 
5403 
5527 
5647 
5763 
5877 
5988 
6096 
6201 
6304 
6405 
6503 
6599 
6693 
6785 
6875 
6964 
7050 
7135 
7218 
7300 
7380 


0334 
0719 
1072 
1399 
1703 
1987 
2253 
2504 
2742 
2967 
3181 
3385 
3579 
3766 
3945 
4116 
4281 
4440 
4594 
4742 
4886 
5024 
5159 
5289 
5416 
5539 
5658 
5775 
5888 
5999 
6107 
6212 
6314 
6415 
6513 
6609 
6702 
6794 
6884 
6972 
7059 
7143 
7226 
7308 
7388 
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TABLE II (Continued) 
Mantissas FOR Common LOGARITHMS 
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7412 
7490 
7566 
7642 
7716 
7789 
7860 
7931 
8000 
8069 
8136 
8202 
8267 
8331 
8395 
8457 
8519 
8579 
8639 
8698 
8756 
8814 
8871 
8927 
8982 
9036 
9090 
9143 
9196 
9248 
9299 
9350 
9400 
9450 
9499 
9547 
9595 
9643 
9689 
9736 
9782 
9827 
9872 
9917 
9961 


7419 
7497 
7574 
7649 
7723 
7796 
7868 
7938 
8007 
8075 
8142 
8209 
8274 
8338 
8401 
8463 
8525 
8585 
8645 
8704 
8762 
8820 
8876 
8932 
8987 
9042 
9096 
9149 
9201 
9253 
9304 
9355 
9405 
9455 
9504 
9552 
9600 
9647 
9694 
9741 
9786 
9832 
9877 
9921 
9965 


7427 
7505 
7582 
7657 
7731 
7803 
7875 
7945 
8014 
8082 
8149 
8215 
8280 
8344 
8407 
8470 
8531 
8591 
8651 
8710 
8768 
8825 
8882 
8938 
8993 
9047 
9101 
9154 
9206 
9258 
9309 
9360 
9410 
9460 
9509 
9557 
9605 
9652 
9699 
9745 
9791 
9836 
9881 
9926 
9969 


7435 
7513 
7589 
7664 
7738 
7810 
7882 
7952 
8021 
8089 
8156 
8222 
8287 
8351 
8414 
8476 
8537 
8597 
8657 
8716 
8774 
8831 
8887 
8943 
8998 
9053 
9106 
9159 
9212 
9263 
9315 
9365 
9415 
9465 
9513 
9562 
9609 
9657 
9703 
9750 
9795 
9841 
9886 
9930 
9974 


7459 
7536 


7686 
7760 
7832 
7903 


8041 
8109 
8176 
8241 
8306 
8370 
8432 
8494 
8555 
8615 
8675 
8733 
8791 
8848 
8904 
8960 
9015 
9069 
9122 
9175 
9227 
9279 
9330 
9380 
9430 
9479 
9528 
9576 
9624 
9671 
9717 
9763 
9809 
9854 
9899 
9943, 
9987 
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TABLE III 
FREQUENCY AND RELATIVE CUMULATIVE FREQUENCY DISTRIBUTIONS FOR THE 
STANDARD NORMAL POPULATION GIVEN FOR THE ABSCISSAS FROM A = —3.00 
To A = +3.00 
Abscissas Ordinates Abscissas Ordinates Abscissas Ordinates 

: A y mej: A y ref. A Y mad 
—3.00 .004  .001| —1.14 .208  .127| —0.30 .981  .382 
—2.90 .006 | .002| —1.12 .213  .181| —0.28 .384  .390 
—2.80 .008  .003| —1.10 .218 .136| —0.26 .386  .397 
—2.70 .010  .003| —1.08 .223  .140| —0.24 .388  .405 
—2.60 -014  .005| —1.06 .927  .145| —0.22 .389  .413 
—2.50 .0018 .006) —1.04 .232  .149| —0.20 .991  .421 
—2.40 .022  .008| —1.02 .237  .154| —0.18 .393  .429 
—2.30 .028 .011| —1.00 .241 .159| —0.16 .394 .436 
| —2.25 .032 .012| —0.98 .247 .164| —0.14 .3995  .444 
—2.20 .035 .014| —0.96 .252 .169| —0.12 .396 .452 
—2.15 .040 — .016| —0.94 .256 — .174| —0.10 .397  .460 
—2.10 .044  .018| —0.92 .261  .179| —0.08 .398  .468 
—2.05 .049  .020| —0.90 .200 .184] —0.06 .398  .476 
—2.00 .054  .023| —0.88 .971  .189| —0.04 .399  .484 
—1.95 .060  .026| —0.86 .276 | .195| —0.02 .899  .492 
—1.90 .066  .029| —0.84 .280 | .200 0.00 .399  .500 
—1.85 .072  .032| —0.82 .285  .206| +0.02 .399  .508 
—1.80 .079  .036| —0.80 .2900 — .212 0.04 .399 516 
=1.75 .086  .040| —0.78 .204  .218 0.06 .398 524 
1.70 .094  .045| —0.76 .209  .224 0.08 .398 532 
—1.66 -101 .048| —0.74 .303  .230 0.10 .397 540 
=1.062 .107  .053| —0.72 .308 .236 0.12 .396 548 
—1.58 .114  .057| —0.70 .312  .242 0.14 .395 556 
—1.54 .122  .062| —0.68 .317  .248 0.16 .394 564 
—1.50 .130  .067| —0.66 .921  .255 0.18 .393 571 
—1.48 .133  .069| —0.64 .325  .261 0.20 .391 579 
—1.46 .137  .072| —0.62 .329  .208 0.22 .389 587 
—1.44 .141  .075| —0.60 .933  .274 0.24 .388 595 
—1.42 .146  .078| —0.58 .337  .281 0.26 .386 603 
—1.40 .150  .081| —0.56 .341 .288| 0.28 .884 —.610 
-1:38 .154  .084| —0.54 .945  .295 0.30 381 618 
—1.36 .158  .087| —0.52 .348 | .302 0.32 .379 626 
—1.84 .163  .090| —0.50 .352  .309 0.34 377 633 
—1.82 .167  .093| —0.48 .356  .316 0.36 374 641 
—1.30 -171  .097| —0.46 .359  .323 0.38 .371 648 
—1.28 .176 | .100| —0.44 .362  ..330 0.40 368 655 
—1.26 .180 .104} —0.42 .365  .337 0.42 .365 663 
—L24 .185  .107| —0.40 .368  .345 0.44 .362 670 
1.29 .190 .111| —0.38 .971  .352 0.46 359 677 
—1.20 .194  .115| —0.36 .974  .359 0.48 356 684 
—1.18 .199  .119| —0.34 .377  .307 0.50 352  .691 
—1.16 .204  .123| —0.32 .379  .374 0.52 348  .698 
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TABLE III (Continued) 


FREQUENCY AND RELATIVE CUMULATIVE FREQUENCY DISTRIBUTIONS FOR THE 


STANDARD NORMAL POPULATION GIVEN FOR THE ABSCISSAS FROM À = —3.00 
TO À = +3.00 
Abscissas Ordinates Abscissas Ordinates Abscissas Ordinates 

A y f.6.f. A y Toy. A y ref. 
0.54 .945 — .705 1.04 .232  .851 1.54 .122  .938 
0.56 .941  .712 1.06 .227  .855 1.58 .ll4  .943 
0.58 .937  .719 1.08 -223  .860 1.62 .107  .947 
0.60 .933  .726 1.10 .218  .864 1.66 .101 .952 
0.62 .329  .732 1.12 .213  .869 1.70 .094  .955 
0.64 .325  .739 1.14 .208  .878 1.75 .086 — .960 
0.66 .921  .745 1.16 .204  .877 1.80 .079 .964 
0.68 .317 752 1.18 -199 .881 1.85 .072 .968 
0.70 -312  .758 1.20 .194  .885 1.90 .066 — .971 
0.72 .308 — .764 1.22 .190 888 1.95 .060 .974 
0.74 .903  .770 1.24 -185  .893 2.00 .054  .977 
0.76 .209 .776 1.26 -180  .896 2.05 .049  .980 
0.78 .204  .782 1.28 .176 — .900 2.10 .044  .982 
0.80 .290  .788 1.30 .171  .903 2.15 .040 — .984 
0.82 .285 — .794 1.32 .167  .907 2.20 .035  .986 
0.84 .280 — .800 1.34 -163  .910 2.25 .032  .988 
0.86 .276 — .805 1.36 -158 — .913 2.30 .028 — .989 
0.88 .271 .811 1.38 -154  .916 2.40 .022 .992 
0.90 .266 — .816 1.40 .150  .919 2.50 .018 — .994 
0.92 -261  .821 1.42 .146  .922 2.60 .014 — .995 
0.94 .256 — .826 1.44 .141 .925 2.70 .010 .997 
0.96 .252  .831 1.46 .137  .928 2.80 .008  .997 
0.98 -247  .836 1.48 .133  .931 2.90 .006 — .998 
1.00 .241  .841 1.50 -130  .933 3.00 .004 .999 
1.02 .237  .846 
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TABLE IV 


RELATIVE CUMULATIVE FREQUENCY DISTRIBUTION OF t SHOWING THE Pro- 
PORTIONS OF ALL SAMPLING t; WITH THE SAME DEGREES OF FREEDOM WHICH 
ARE Less THAN THE t SHOWN IN COLUMN 1 on THE LEFT 


Degrees of Freedom 
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.988 
991 
995 
997 
.998 
-999 
.999 1.000 


9 


.000 
.001 
.001 
.002 
.004 
.007 
.010 
.014 
.020 
.028 
.038 
.053 
.072 
.098 
.130 
.172 
.222 
.282 
.349 
.423 
-500 
577 
-651 
.718 
778 
.828 
.870 
-902 
.928 
947 
962 
972 
-980 
986 
-990 
.993 
-996 
.998 
.999 
-999 


¿999 


‘999 


12 


.999 


1.000 1.000 1.000 


14 


-000 
001 
.002 
-005 
-007 
.010 
015 
.023 
.033 
.047 
066 
.092 
.125 
.167 
219 
.279 
.348 
.422 
.500 
578 
.652 
.721 
-781 
833 
875 
908 
-934 
-953 
-967 
977 
985 
-990 
-993 
995 
-998 


.999 


16 


1099 


18 


.099 


20 


-000 
001 
001 
004 
.006 
.009 
.013 
.020 
.030 
.043 
.063 
.088 
.122 
165 
217 
278 
.347 
.422 
.500 
578 
653 
422 
-783 
835 
878 
.912 


.937 


.957 
.970 
.980 
987 
991 
994 
-996 
-999 


.999 


1.000 1.000 1.000 1.000 


22 


26 


30 


.007 


Some FREQUENTLY USED ('s CORRESPONDING TO PRE-ASSIGNED PROBABILITIES 
OF OCCURRENCE DURING RANDOM SAMPLING 


Degrees of Freedom 


Pissa 8 $ 10 11 12 14 16 18 20 22 2 30 
.100 1:860 1:88 1.81 1580 1:78 176 1.756 178 1:72 1.72% 1270) 3570 
.050 2.31 2.26 2.23 2.20 2.18 2.14 2.12 2.10 2.09 2.07 2.06 2.04 
0010 3.36 3.25 3.17 3.11 3.06 2.98 2.92 2.88 2.84 2.82 2.78 2.75 
.001 5.04 4.78 4.59 4.44 4.32 4.14 4.02 3.92 3.85 3.79 3.71 3.65 


252 TABLES 


TABLE V 


RELATIVE CUMULATIVE FREQUENCY DISTRIBUTION or x? SHOWING PROPOR- 
TION OF ALL SAMPLING x? WITH Same DEGREES or FREEDOM WHICH ARE 
GREATER THAN THE x? SHOWN ON THE LEFT 


Degrees of Freedom Degrees of Freedom 


x 1 Z 78 x: 1 2 3 

0.40 — .83 6.00 — .014  .050  .1l 

0.50 48 6.25 .012  .044 .10 

0.60 — .44 6.50 — .011  .039  .090 
0.70 40 6.75 .009  .034  .082 
0.80 37 7.00 | .008  .030  .072 
0.00 — .34 7.50 .006  .093  .057 
1.00 .32 8.00 .005 .018 .046 
1.25  .26 53 8.50 .003  .014 .036 
1.0 — .22 47 9.00 — .003  .011  .099 
1.75 .19 .42 9.50 .002 .009 .024 
2.00 — .16 — .37 10.00 — .002  .007  .019 
2.05  .13  .33 .50 10.50 — .001  .005  .015 
2.50 11 — .29  .47 11.00 — .000  .004  .012 
2.75 10 — .95 — .43 11.50 .003 — .010 
3.00 084 2 39 12.00 .002  .007 
3.25 071  .20 .35 12.50 .002 ^ .006 
3.50 — .061  .18 32 13.00 .002  .005 
3.75  .053  .15  .2 13.50 .001  .004 
4.00 — .046 13  .26 14.00 .001  .003 
4.95 — .0399  .12 24 14.50 .001  .002 
4.50 .034  .10 2 15.00 .001  .002 
4.75 .099  .092  .19 15.50 .000 — .002 
5.00 — .025 .082  .17 16.00 001 
5.95 — .022  .073  .15 17.00 .001 
5.80 — .019 .064  .14 18.00 : 


5.75 .016  .057  .12 i ind: 


VALUES OF THE FUNCTION, y = (1/V 2r )-e— 
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TABLE VI 
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w y w y w y 
0 .399 1.8 .066 4.2 .006 
0.1 .361 1.9 .060 4.4 .005 
0.2 .327 2.0 .054 4.6 .004 
0.3 .295 2.1 .049 4.8 .003 
0.4 .267 2.2 .044 5.0 .003 
0.5 .242 2.3 .040 5.2 .002 
0.6 .219 2.4 .036 5.4 .002 
0.7 .198 2.5 .033 5.6 .001 
0.8 .179 2.6 .030 5.8 .001 
0.9 .162 2.7 .027 6.0 .001 
1.0 .147 2.8 .024 6.2 .001 
Tl .133 2.9 .022 6.4 .001 
1:2 .120 3.0 .020 6.6 .001 
1.3 .109 3.2 .016 6.8 .000 
1.4 .099 3.4 .013 

1.5 .089 3.6 .011 

1.6 .081 3.8 .009 

LT .073 4.0 .007 


TABLES 


254 


c 
CN 


0c I 61 
061 6I I ST 
OFITI LAT ST I JA 
CPSP 696 ECT LI I 9T 
FOSST |. 9/88  9I8 9er 9T I GT 
09488  8Z9TT 0908 089 OZI GT I PI 
03522 ZETZZ 8998 08ec 099 GOT PI T er 
OL6SZI 88£0S  T9G8T 8819 OZ8T cr T6 er I eI 
096491 Z8ccL FSIS OLEZI  S9er OST FIE 8. A | IT 
9GAP8T  8SL£Z6  SGLeP  SPFOTL 8008 8008 TOOT 986 99 T TI 0T 
096491 84866  0798P OIZ OFFIT 9008 codec SIL Ocz SS OL I 6 
0469061 Z8SSL 8SLEP  OTEPZ 02871 Sero  e008 L82I Ser SOT Sh 6 I 8 
0cG4A 8880S  TvCSTe  SPPOI OPPII Gero cere OTZT 362 OEE OCT 908 8 T L 
00488 ZEIZZ  v9981 9LEZE $8008 9000 800 9IZT c6 cr OI 78 8 2 I 9 
POSS. SZOILT 80068 8819 ggep 2008  cO0c 2831 2664 Z9 ZS- 921 99 TS 9 I G 
GPSF 0/88 0908 O88 ozar Goer TOOL SM S6r 0868 013 9T OL SE SI C T 
OPTT 696 918 089 09€ gap poe 083 Occ S91 OCcI v8 9G Se OC OI € 
061 TZT eer 9er 031 1) 16 82 Soe EGG - "Op . 0p 28% Aee "ROT Z 
0S 6I 8I LI 91 eT PI er ie i) 56 Qi a) I 
I I I I I I I I I I I I E IL «homm 0 
roo ya İū a ë Soas oto. > T —  . A —— ues 0 0 0o 
02 61 ST LI 9T GT PI er Aten Ti OL 6 Ri hie Oia 4 
u 


¡(e — U)jt/ju = +9 :SLNSIOIJASO,) 'IVINONIQ 


TA WAV 


255 


TABLES 


T7829 
6860 T9 
LGI 668 
90T ^ LS 
8220"9€ 
OFGP FS 
1182'36 
LLEV TS 
PEGS 6T 
9TI6' LP 
960€'9T 
CSTA' vr 
L8ET EP 
GOLG'TP 
GP TIO OT 
COLP'SE 


og0T X 87'809 
601 X TPTI'FGI 
¿SOT X 69'866 
0T X 93066 
:eO0T X CO'6TI 
«OT X €8'696 
oT X GTI'T09 
601 X OC'OFT 
OT X EC'PEE 
eyOT X 66'€IS 
+01 X 86'806 
a 0r X 60'6€66 
OI X $9 LET 
601 X 66° TLE 
se0T X $8'80I 
901 X $6'$66 


L8E6'9E 
COST SE 
0€16' $6 
LEGV 6$ 
C9T6'0€ 
Iv8T' 66 
04$0'86 
9€09'06 
906T'€6 
1662/66 
CCIP 66 
80€0' T6 
$807/6I 
I98€'8I 
1680 7T 
£908"ST 


+£0T X $8'898 
eg0I X ST E9S 
101 X 86668 
og0T X €6'€96 
e;O0T X ST' 788 
1:01 X 68 70€ 
o; 0T X 68'80I 
s 0T X 67'€0P 
e;OT X IT'GGI 
101 X ST'0c9 
001 X c€'866 
eiOT X OF'GII 
nOT X I6'0I6 
910L X 66'$T6 
e1OT X SO ICT 
&iDI X PZ'0v9 


TISS"PT 
9068 €I 
SOIT'GI 
POG OT 
$7626 
$089'8 
GT09'7 
866€'0 
86S9°S 
Sc09'p 
POLE 
SLOSS 
G6L0°% 
6088 I 
G8LL'0 
OTOS'0 


10T X 69°SSE 
10T X £7607 
o10T X ZLOSTI 
g0T X 84148 
¿OT X 04/669 
90T X 00'64T 
¿OT X 2T'66£ 

0088698 

088698 

OZE0F 

OPOS 

06 

OCI 

T6 

9 

G 


ore 
rH ra 


CN MFHIN kb 00 C» 


ee Eee 


ju 30] 


ju 


u 


ju 30] 


ju 


Neg m c nn 


u 


SWHLIYUVSOT WIGHT, ANV S'IVIHOLOV,[ 


IIIA WISHV.L 


¡u 30] 


ES 


256 TABLES 


TABLE IX 


PROBABILITY DISTRIBUTION OF G = | i — p |/(RANGE) FOR A SAMPLE or SIZE 
n FROM A NORMAL POPULATION 


Probability that G will be greater than table value 
Sample A 
Size, n 
.100 .050 .020 .010 .002 .001 

2 3.157 6.353 15.910 31.828 159.16 318.31 
3 0.885 1.304 2.111 3.008 6.77 9.58 

4 0.529 0.717 1.023 1.316 2.29 2.854- 
5 0.388 0.507 0.685+ 0.843 1.32 1.58 
6 0.312 0.399 0.523 0.628 0.92 1.07 
7 0.263 0.333 0.429 0.507 0.71 0.82 
8 0.230 0.288 0.366 0.429 0.59 0.67 
9 0.205 0.255 0.322 0.374 0.50 0.57 
10 0.186 0.230 0.288 0.333 0.44 0.50 
11 0.170 0.210 0.262 0.302 0.40 0.44 
12 0.158 0.194 0.241 0.277 0.36 0.40 
13 0.147 0.181 0.224 0.256 0.33 0.37 
14 0.138 0.170 0.209 0.239 0.31 0.34 
15 0.131 0.160 0.197 0.224 0.29 0.32 
16 0.124 0.151 0.186 0.212 0.27 0.30 
17 0.118 0.144 0.177 0.201 0.26 0.28 
18 0.113 0.137 0.168 0.191 0.24 0.26 

19 0.108 0.131 0.161 0.182 0.23 0.25+ 
20 0.104 0.126 0.154 0.175— 0.22 0.24 


The above table was derived from Table 9, page 66, vol. 34, Biometrika, in 
an article by E. Lord, with the permission of the publishers of Biometrika. 
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TABLE X 


PnonasiuTY DISTRIBUTION or G = | 3 — @|/QmBAN RANGE) ron Two 
SAMPLES EACH or Size n FROM THE Same NORMAL POPULATION 


Probability that G will be greater than tabular value 
Sample 
Size, n 
-100 .050 .020 .010 .002 001 
2 2.322 3.427 5.553 7.916 17.81 25.23 
3 0.974 1.272 1.715— 2.093 3.27 4.18 
4 0.644 0.813 1.047 1.237 1.74 1.99 
5 0.493 — 0.613 0.772 0.896 1.21 1.354- 
6 0.405 0.499 0.621 0.714 0.94 1.03 
7 0.347 . 0.426 . 0.5254- 0.600 0.77 0.85— 
8 0.306 — 0.373 0.459 0.521 0.67 0.73 
9 0.275— 0.334 0.409 0.464 0.59 0.64 
10 0.250 0.304 0.371 0.419 0.53 0.58 
11 0.233 0.280 0.340 0.384 0.48 0.52 
12 0.214 0.260 0.315+ 0.355+ 0.44 0.48 
13 0.201 0.243 0.294 0.331 0.41 0.45— 
14 0.189 0.228 0.276 0.311 0.39 0.42 
15 0.179 0.216 0.261 0.293 0.36 0.39 
16 0.170 0.205— 0.247 0.278 0.34 0.37 
17 0.162 0.195-- 0.236 0.264 0.33 0.354- 
18 0.155+ 0.187 0.225+ 0.252 0.31 0.34 
19 0.149 0.179 — 0.216 0.242 0.30 0.32 
20 0.143 0.172 0.207 0.232 0.29 0.31 


The above table was derived from the same article as Table LX, also with 
the permission of the publishers of Biometrika. 
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Index 


A-B blood groups, 56 
ACE test scores, 10, 11, 16, 23, 25, 41 
Addition theorem, for the chi-square 
distribution, 141 
for probabilities, 53 
Approximation of binomial distribution 
by normal, 87, 101 
Arley, Niels, 75, 242 
Atkeson, F. W., 140 
Averages, arithmetic mean, 13 
geometric mean, 20 
harmonic mean, 21 
median, 19 
midrange, 12 
mode, 20 
properties of, 12 


Barmore, Mark, 221 
Bartley, E. E., 181 
Batting averages, 47, 93 
Binomial coefficients, 65, 254 
Binomial frequency distribution, 77, 79 
Binomial probability function, 68 
Bivariate population, 198 
Blood factors, A-B, 56 

M-N, 60 

P, 60 

Rh, 60 
Buch, K. Rander, 75, 242 


Central 95 per cent, 124, 125, 126 
Chi-square, empirical distribution for 1 
D/F, 133 s 
probability distribution table for 1, 2, 
and 3 D/F, 252 
use in testing Ho(p — po), 132 
use in testing Ho(pi = p»), 136 
use in testing Ho(pitpo:ps = 
Px’ ipy' tps’), 144 
use with contingency tables, 143 
Clos, 124 


Classes of events, 51 
Class interval, length of, 23, 28 
midpoint of, 28 
Clopper, C. J., 126 
Coefficient of correlation, product-mo- 
ment, 217, 224 
rank, 227 
Coefficient of linear regression, 199, 224 
Compound probability law, 55 
Confidence coefficient, 122, 165, 214 
Confidence interval, based on G-distri- 
bution, 183 
length of, 169 
observed examples of, 166 
on the correlation coefficient, p, 220 
on the mean, y, 165 
on ui — ye, 179 
on py.z, 213 
On py-zx;, 214 
on the proportion, p, in a binomial 
population, 124, 127 
on the regression coefficient, 8, 211 
Contingency tables, 136, 143 
Control limits, 148 
Correlation coefficient, computation of, 
217, 224 
hypotheses about, 218 
product-moment, 217 
rank, 227 
Cumulative distribution curve, 23, 25 
Cumulative frequency table, 23, 25 
Curvilinear trends, 196, 197 


Decisions based on samples, 114 
Degrees of freedom (D/F), for chi- 
square, 134, 142 
for estimate of correlation, 209, 218 
for estimate of regression, 209, 211 
for estimate of standard deviation of 
regression, b, 211 
259 


260 


Degrees of freedom (D/F), for estimate 
of standard deviation of X, 162 
for estimate of variance about trend 
line, 210 
for t-test, 163, 178, 211, 218 
Deviations from the mean, 14 
Dice, classes of events, 52 
Difference between two means, 176, 179 
Distribution, binomial frequency, graph, 
78 
cumulative, 23, 25 
normal frequency, curve, 91, 93 
normal frequency, formula, 90, 92 
normal probability, curve, 96 
normal probability, table, 249 
of chi-square, 133, 252 
of correlation, r, 218 
of difference between means, z; — Zo, 
176 
of mean, z, 155, 157, 158 
of t, 162, 251 
of z = (1/2) log, (1 +7)/(1 — 7)], 219 
Standard normal, 92, 249 
Dixon, Wilfrid J., 48, 112, 152, 191, 242 
Dot (Scatter) diagram, 194 


Efficient estimates, 123, 160 
Elastic prices, 232 
Empirical distribution, of chi-square, 133 
of correlation coefficient, 218 
of difference between sample means, 
176 
of sample means, 155, 157 
of t, 9 D/F, 162 
of z, 219 
Error of the first kind, 116 
Error of the second kind, 116 
Estimated average Y fora given X, 204, 
206 
Estimation, of the linear corre! 
efficient, 217 
of the linear regression coefficient, 
205, 211 
of the mean of a normal 
160 
of the mean Y for a given X, 204, 212 
of the percentage, p, for a binomial 
population, 122 
of the standard deviation, about the 
lin ar trend line, 210 


lation co- 


population, 


INDEX 


Estimation, of the standard deviation, 
of the regression coefficient, b, 211 
of the sample mean, 161 
of X, 160 
point, 122 
unbiased, 122 
Events, dependent, 55 
exhaustive set of, 51 
independent, 54 
mutually exclusive, 53 
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